Geoff-Hart.com: Editing, Writing, and Translation

Home Services Books Articles Resources Fiction Contact me Français

You are here: Articles --> 2021 --> Managing your study data, part 2

Vous êtes ici : Essais --> 2021 --> Managing your study data, part 2

Managing your study data and the supporting documentation. Part 2: Create the necessary project files

By Geoffrey Hart

Previously published as: Hart, G. 2021. Managing your study data and the supporting documentation. Part 2: Create the necessary project files.

In part 1 of this series of articles, I discussed the importance of rigorously documenting your research. In this part, I'll provide more details on how to create and organize your project files.

Note: This series of articles is based on the following book, but with the details modified for non-psychological research: Berenson, K.R. 2018. Managing Your Research Data and Documentation. American Psychological Association. 105 p. including index.

Project files include official records such as applications for funding, lists of requirements that you must fulfill to satisfy the conditions attached to any funding you receive (e.g., how to cite the funding agency and grant number in the Acknowledgments section of a journal paper), and approvals by your organization’s institutional review board (IRB). In terms of grant requirements, reviewing your project files provides an opportunity to carefully ensure that you performed all the analyses that you promised to perform in your funding proposal. It’s acceptable to do more analyses than you promised, but rarely acceptable to perform fewer analyses.

The Project folder should also include any documents that guided your research, such as questionnaires or data-entry forms and the formal protocols thta you designed for administering questionnaires to research subjects or that you used for collecting data in the field or laboratory. If your research group has developed a standard experimental protocol that defines the sequence for recording, storing, and processing data, record that protocol and store it in the Project folder. You can subsequently use these documents as checklists to guide you as you begin to analyze your data. This provides a convenient visual indicator of what processing steps have been completed and what steps must still be done.

You can also include documents that explain any issues with the data, such as the fact that your data in May 2020 were recorded before a key instrument was recalibrated whereas the June 2020 data were recorded after calibration, or the fact that one day of field data was recorded on a cloudy day with little sun but strong winds and the rest of the data was recorded on a calm sunny day. These differences, whether small or large, should not be ignored, since they may affect how you interpret your results.

Also document any problems you solved during analysis, such as your decisions on how to deal with missing data and your criteria for data that should be excluded. For example, particle physicists often use “five-sigma” (five standard deviations) as their criterion for a statistically significant result. If you believe that some data are outliers because they were recorded under inappropriate conditions (e.g., on a cloudy day), record the criteria you used to define them as outliers. The data may still be usable, particularly if it can be combined with data collected under similar conditions, but it may not be safe to combine this data with data collected under very different conditions.

If you will be working with human subjects, privacy and confidentiality considerations suggest that you should store all information that would identify specific individuals separately from the data you collected from those individuals. For example, if I am study subject 1 in your research, I should be identified only as subject 1 in all data files. A password-protected file that contains all information on your subjects would, if necessary, provide all my personal data under the name “subject 1”. (For example, if it becomes necessary to stop a vaccine trial because of serious side-effects, the personal data file lets researchers know who they must contact and how to contact them.) This separation of subject identifiers from the data for each subject is important for double-blinded research. It’s also important if you will be providing your data to other researchers for use in (for example) meta-analyses, since you may need to anonymize the data by removing all personal identification before you make your files available.

Data Files

To emphasize the key characteristic of this folder, I would change the name to “Raw data” or “Original data” to emphasize that all files in this folder represent archival data that should never be changed, not working data (which belongs in the Working Files directory that I describe in the next section). This distinction is essential; if you damage or lose your original copy of the data, you cannot restore it, but if you damage or lose a working file, you can recreate it by starting from the original data. Before you begin working with these files, change their permissions to “read only” to prevent accidental modification of their contents. You can then duplicate the file and rename it by adding “working copy”, “uncleaned data”, or some other name before you begin to clean or analyze the data.


To protect files against accidental modification, change their format to “read only”:


Note that although these file protections are helpful, they are not a substitute for a rigorous backup and archiving strategy. Although your employer’s computer staff should implement such a strategy for you, you can also create your own backups. For some thoughts on how to do this, see "Backing up your data… and other important things" (part 1 and part 2).

Spend a few moments thinking about how you will analyze your data so that you can store this original data in logical groups. For example, if you performed a longitudinal study, create subfolders with names based on the dates when you obtained the data. If your study is conducted at multiple locations, create subfolders with names that are based on the locations, such as “Field” and “Laboratory”. Subfolders within those folders could be named based on the data type (e.g., video records vs. chemical analyses) or the recording method (e.g., recorded using a datalogger vs. manually recorded).

The Data Files folder should also include:

Also include a document that clearly explains your conventions for naming the variables you measured. This is particularly important if the instruments you use to obtain a measurement assign cryptic variable names that only a machine could love. Create a table that provides the machine-assigned names in one column and the human-friendly names in a second column. This is also important if you are working in one language (e.g., Japanese) but writing papers in another language (e.g., English), since the way you choose variable names will differ between languages. For example, English variable names are often based on the first letter of each word (e.g., Net Primary Production = NPP).

When you choose variable names, ensure that they reflect the meaning of the data. For example, some variables are based on binary logic in which 1 = yes and 0 = no. Thus, your variable name should include the word “treated” if 1 = treated and “untreated” if 0 = not treated. Similarly, if you use a variable named “response strength” with values ranging from 1 to 10, use 1 (a low number) for a low strength and 10 (a high number) for a high strength. This may seem obvious, but I’ve corrected serious errors in some of the papers I edited that resulted from authors forgetting the meaning of their variables and reaching incorrect conclusions.  If you accidentally chose a confusing variable name, choose a clearer name and recode your data so that it agrees with that name. Name these variables systematically (e.g., add “recoded” to the new variable name) so that when you review your data, it will be clear whether you’re using the raw data or the recoded data.

Working Files

Working files are files that you will continue changing until your analysis is complete. The essential purpose of the Working Files folder is to ensure that you are not working on your original or cleaned data files; if those files are damaged, it may be impossible to recover them, or at least very time-consuming. (If you follow a rigorous backup routine, recovery of those original files will be easer.)

After each analytical step that requires significant amounts of work, consider the possibility of storing the results file in your Data Files folder so that you can obtain a copy of that version of your data if it’s necessary to repeat your analysis of the cleaned data to correct a problem. Create a copy of the cleaned data and work on that copy.

Note: Cleaning files involves removing obvious errors such as typing errors, extreme outliers, and data that you collected under inappropriate conditions (e.g., during a rainstorm for an instrument that requires dry air). I’ll discuss this more in Part 3 of this article.

Reviewing your data to detect problems should be done as soon as possible after you collect the data. For example, in social science or psychology research, you should confirm that the subject has not skipped any questions before you finish the interview and let them leave. You can then ask them to provide a response, or ask them to clarify the meaning of an ambiguous response to an open-ended question. If you’re working in a laboratory, quickly review your data to ensure that there are no obvious problems; if there are, you may be able to fix the problem (e.g., recalibrate the instrument) and immediately repeat the measurement; it may be impossible to repeat the measurements several days later. This is particularly important during field research at remote locations. It’s sometimes possible to remain in the field for an extra day to repeat your data collection, but it may be impossible to return to a remote site if you only discover a problem after you return home.

One of the first steps in cleaning your files is to proofread the file to detect data-entry errors. One common approach used for information that is recorded on paper (e.g., a written questionnaire) is to have one person read the data on paper while the other person checks it against the data that has been entered in a database or spreadsheet. You can also automate this by having two people enter the same dataset into separate computer files, although this may be impractical for large datasets. For example, you could create an Excel file named “Geoff’s input data” and a second file named “Matt’s input data”. You can then create a third file that contains the results of subtracting the value in Matt’s data from the corresponding value in Geoff’s data. If the value are identical, the result of this operation will be 0. However, if the result is any other value, this indicates a data entry error, such as a missing row of data. (This approach also makes it easy to find the missing data, since the cell number where the difference value changes from 0 to some other number equals the position in the original data.)

For larger datasets, and particularly for machine-recorded data, you can set your spreadsheet or statistical software to highlight values that lie outside the expected or permitted range of values. For example, if you measured the length of a plant part to a precision of 0.1 mm, and values <0.1 mm represent errors. (That is, any plant part that can be measured has a non-zero value.)

In Part 3 of this article, I’ll discuss data validation and some software techniques you can use to process your data.


©2004–2024 Geoffrey Hart. All rights reserved.