Data Preparation: Editing, Coding, and Cleansing

Data Editing

Information gathered during data collection may lack uniformity. For example, data collected through questionnaires and schedules may have answers that are not marked in the proper places, or some questions may be left unanswered. Sometimes, information may be given in a form that needs reconstruction into a category designed for analysis, such as converting daily or monthly income into annual income. The researcher has to decide how to edit it.

Editing also ensures that data are relevant and appropriate, and errors are modified. Occasionally, the investigator makes a mistake and records an impossible answer. For instance, “How much red chilies do you use in a month?” The answer is written as “4 kilos.” Can a family of three members use four kilos of chilies in a month? The correct answer could be “0.4 kilos.”

Care should be taken in editing (rearranging) answers to open-ended questions. For example, sometimes a “don’t know” answer is edited as “no response.” This is wrong. “Don’t know” means that the respondent is not sure, is in a double mind about their reaction, or considers the question personal and does not want to answer it. “No response” means that the respondent is not familiar with the situation, object, event, or individual about which they are asked.

Data Coding

Coding is translating answers into numerical values or assigning numbers to the various categories of a variable to be used in data analysis. Coding is done by using a codebook, code sheet, and a computer card. Coding is performed based on the instructions given in the codebook. The codebook gives a numerical code for each variable.

Nowadays, codes are assigned before going to the field while constructing the questionnaire or schedule. After data collection, pre-coded items are fed to the computer for processing and analysis. For open-ended questions, however, post-coding is necessary. In such cases, all answers to open-ended questions are placed in categories, and each category is assigned a code.

Manual processing is employed when qualitative methods are used, when a small sample is used in quantitative studies, when the questionnaire or schedule has a large number of open-ended questions, or when accessibility to computers is difficult or inappropriate. However, coding is done in manual processing as well.

Data Cleansing

Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting this dirty data.

Data cleansing can occur within a single set of records or between multiple sets of data that need to be merged or that will work together. Typos and spelling errors are corrected; mislabeled data is properly labeled and filed, and incomplete data is addressed.

  • Questionnaire Checking: Questionnaire checking involves eliminating unacceptable questionnaires. These questionnaires may be incomplete, instructions not followed, little variance, missing pages, past cutoff date, or the respondent may not be qualified.
  • Editing: Editing looks to correct illegible, incomplete, inconsistent, and ambiguous answers.
  • Coding: Coding typically assigns alpha or numeric codes to answers that do not already have them so that statistical techniques can be applied.
  • Transcribing: Transcribing data involves transferring data to make it accessible to people or applications for further processing.
  • Cleaning: Cleaning reviews data for consistencies. Inconsistencies may arise from faulty logic, out-of-range, or extreme values.
  • Statistical Adjustments: Statistical adjustments apply to data that requires weighting and scale transformations.
  • Analysis Strategy Selection: Finally, the selection of a data analysis strategy is based on earlier work in designing the research project but is finalized after considering the characteristics of the data that has been gathered.