Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
Brief Report
Case Report
Case Series
Editorial
Focus
Images/Instrument in Dermatology/Dermatosurgery
Innovations
Letter to Editor
Letter to the Editor
Living Legends
Looking back in history
Original Article
Perspective
Resident Forum
Review Article
Spot the Diagnosis
Tropical Dermatology
Visual Treats in Dermatology
Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
Brief Report
Case Report
Case Series
Editorial
Focus
Images/Instrument in Dermatology/Dermatosurgery
Innovations
Letter to Editor
Letter to the Editor
Living Legends
Looking back in history
Original Article
Perspective
Resident Forum
Review Article
Spot the Diagnosis
Tropical Dermatology
Visual Treats in Dermatology
Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
Brief Report
Case Report
Case Series
Editorial
Focus
Images/Instrument in Dermatology/Dermatosurgery
Innovations
Letter to Editor
Letter to the Editor
Living Legends
Looking back in history
Original Article
Perspective
Resident Forum
Review Article
Spot the Diagnosis
Tropical Dermatology
Visual Treats in Dermatology
View/Download PDF

Translate this page into:

Resident Forum
2025
:5;
140
doi:
10.25259/CSDM_197_2025

Missing data in medical research: A practical guide to identification and management

Department of Preventive and Social Medicine, Jawaharlal Institute of Postgraduate Medical Education and Research, Puducherry, India
Department of Community Medicine, Sri Manakula Vinayagar Medical College and Hospital, Puducherry, India.
Author image

*Corresponding author: Mahalakshmy Thulasingam, Department of Preventive and Social Medicine, Jawaharlal Institute of Postgraduate Medical Education and Research, Puducherry, India. mahalakshmi.dr@gmail.com

Licence
This is an open-access article distributed under the terms of the Creative Commons Attribution-Non Commercial-Share Alike 4.0 License, which allows others to remix, transform, and build upon the work non-commercially, as long as the author is credited and the new creations are licensed under the identical terms.

How to cite this article: Thulasingam M, Bhengra MS. Missing data in medical research: A practical guide to identification and management. CosmoDerma. 2025;5:140. doi: 10.25259/CSDM_197_2025

INTRODUCTION

Medical research is vital for disseminating evidence-based knowledge, enabling clinicians to make informed decisions and improve individual patient care.[1] Medical data include a wide range of information, such as a patient’s demographic details, medical history, investigations, diagnoses, treatments, and outcomes.[2] These datasets are highly useful for scientific progress. However, medical research frequently encounters missing data due to data entry errors, incomplete questionnaires, loss to follow-up, participant dropout, selective responses, and disruption in electronic health records. These missing values reduce the sample size, thereby causing biased statistical analyses that can affect the accuracy and validity of the results. Identifying the missing data type and pattern helps researchers choose appropriate statistical methods for managing the data. This article aims to assist clinicians and early-career academics in effectively utilizing these techniques in medical research.

MISSING DATA IDENTIFICATION

During data collection, three primary categories of missing data are observed: “Missing Completely at Random (MCAR),” “Missing at Random (MAR),” and “Not Missing at Random (NMAR).” Researchers must first hypothesize the nature of the missingness in the dataset and apply the appropriate statistical methods, even though no formal test is available to accurately validate these assumptions.

  1. Missing completely at random (MCAR) refers to missing data that occur in a purely random pattern. The likelihood of a missing value is the same for all observed variables in the dataset. However, this assumption is rarely satisfied.[3,4] The most common example is observer bias, when someone else performs clinical assessments in the absence of the primary observer. Here, missing entries occur randomly and are not related to the patient’s characteristics, disease severity, or treatment outcomes. Occasionally, transport barriers and adverse weather conditions may also prevent patients from attending follow-up visits.

  2. Missing at Random (MAR) refers to a situation in which known data can be used to predict missing data. Thus, the likelihood of a missing value is related to the patient characteristics measured in the clinical study. MAR is commonly observed in medical research studies.[3,4] For example, the dermatology life quality index is a patient-reported questionnaire, and a few patients with low literacy tend to leave more questions unanswered. Likewise, a few patients prescribed steroid medications may show poor adherence due to fear of side effects. Some female patients may also refuse a complete examination of skin lesions due to privacy concerns.

  3. Not MAR (NMAR) describes scenarios in which missing data explicitly depend on unobserved variables. The observed data alone cannot predict missingness.[3,4] For example, a few patients with diabetes and recurrent candidiasis may miss follow-up appointments because they have a negative attitude toward taking their medications as prescribed, which can result in high blood sugar levels. Furthermore, the Urticaria Activity Score (UAS7) is used to track disease condition or treatment response in patients with urticaria. However, the patient may skip recording daily logs due to intense itching and discomfort, causing irregularities in the clinical review.

Another classification of missing data is recoverable and non-recoverable data. Recoverable data refer to cases where missingness occurs randomly and is typically due to unidentifiable reasons. In contrast, non-recoverable data occur when missingness is explained by available data or an unobserved value, resulting in more biased data.[5]

METHODS FOR HANDLING MISSING DATA

Complete case analysis

This straightforward approach eliminates all records with missing values and assumes that the remaining data represent the entire study population if missingness is MCAR. However, this method reduces statistical power and may introduce bias if the MCAR assumption is violated.[3-6] For example, in the scenario is shown in Table 1, the highlighted cases are discarded and not included in the analysis.

Table 1: Complete case analysis method for missing data.
Age Gender Diabetes Candidiasis
42 Male Yes Yes
55 Female ____ No
38 Male Yes No
46 Male Yes ____
60 Female Yes Yes
72 Male ____ No
68 Female Yes No
____ Male No Yes
45 Female Yes Yes
35 Female No Yes

Available case analysis

This technique removes only observations with missing data for the specific variable being analyzed. Although it preserves more data than a complete case analysis, comparisons across analyses are difficult because of the varying sample sizes and compositions [Table 2].[3-6] Here, if individual variables are to be separately analyzed, then only missing observations (highlighted cells) are discarded.

Table 2: Available case analysis methods for missing data.
Age Gender Diabetes Candidiasis
42 Male Yes Yes
55 Female No
38 Male Yes No
46 Male Yes ____
60 ____ Yes Yes
72 Male ____ No
68 Female Yes No
____ Male No Yes
45 Female Yes ____
35 Female No Yes

Mean/median substitution

This method substitutes all missing values with the mean/median of the available observations and is therefore easy and fast. If the dataset contains outliers, the median is a reliable method. However, it underestimates variability compared to the deletion method.[3-6] By listing the available age data from above: 42, 55, 38, 46, 60, 72, 68, missing value, 45, 35. The mean age of 51 years is imputed for the missing values.

Linear interpolation

This is mostly applied to time-series data. For example, the Visual Analog Scale (VAS) is used to measure the intensity of symptoms such as itching and burning. If the VAS observation is missing at the 4th week, it is estimated by interpolating between the observations from the previous and subsequent weeks (i.e., the 3rd and 5th weeks).[5] Here, the red-marked point indicates the substituted value at week 4 [Figure 1].

Linear interpolation method for missing data.
Figure 1:
Linear interpolation method for missing data.

Hot deck/cold deck imputation

Hot deck imputation replaces missing values with data from similar records within the same dataset. However, cold deck imputation replaces missing values with data from external sources or historical records. Both strategies tend to preserve authentic data distributions but are significantly dependent on the quality and relevance of the donor data.[4,5] For example, the patient’s missing age can be replaced either by similar cases in the dataset (hot deck) or the average age taken from the National Health Survey Data (cold deck).

Last observation carried forward

This strategy addresses the absence of data in longitudinal investigations by carrying forward the most recent measurement to replace missing values. This is commonly used in dermatological clinical trials, where patients often miss follow-up visits due to severe symptoms or long treatment durations. However, disease status is assumed to remain stable from the last visit, which rarely occurs in practice, resulting in bias.[4,5]

Regression imputation

Regression models predict the missing data points based on the observed variables. This preserves the relationships between variables but may create artificial correlations, potentially biasing the model, particularly when missingness patterns vary across multiple variations.[3-6] For example, the missing VAS observation for itching at the 4th week of an 8-week study is predicted by software that utilizes observed values such as the VAS baseline score, age, and gender, employing a regression model.

Multiple imputation

Multiple imputations generate several datasets (typically five to ten) by replacing missing data with plausible values derived from predictive modeling. When VAS readings are missing, this method generates multiple plausible values based on each patient’s characteristics, performs analysis multiple times, and combines the results to produce a pooled value through software. This provides a comprehensive and robust estimate.[3-6]

Maximum likelihood technique

This statistical method handles missing values by estimating parameters such as mean, variance, and regression coefficients by maximizing the likelihood of the observed data. It does not fill in missing values with a single number. If the assumptions regarding the MAR mechanisms are valid, they offer unbiased estimates.[3-6]

Pattern-mixture model

While this model does not require specific assumptions, it remains the preferred approach when missingness depends on unobserved data. For example, when UAS7 data are missing for a small number of patients, they are first classified based on their reporting patterns (single-day reporting, discontinuation after 2 days, or complete 7-day reporting). The software then provides separate outcomes that are combined to produce results with minimal bias.[4]

Sensitivity analysis

This is beneficial when uncertainty is associated with missing data.[6] For example, when assessing UAS7 scores on patients, some patients with severe symptoms may not report them, suggesting NMAR missingness. This analysis is performed by imputing missing values, using the best clinical outcome (urticaria-free) and worst clinical outcome (severe symptoms with urticaria), then observing whether the treatment effects remain the same or change.

The selection of an appropriate method for handling missing data depends on the missingness mechanism and the proportion of missing observations. Common techniques for handling missing data are summarized in Figure 2.[3-6]

Missing data methods: A decision flowchart towards center.
Figure 2:
Missing data methods: A decision flowchart towards center.

Tips and warnings for handling missing data

  • The proportion of missing data must be reported before the key results

  • Whenever feasible, the underlying cause of missingness should be determined

  • Sensitivity analyses should be performed to verify the assumptions and interpretations

  • Perform a complete case analysis and combine it with sensitivity analysis, especially when the MCAR assumption is not met

  • Single imputation methods should be avoided for primary outcome predictions; instead, multiple imputation or regression methods should be preferred as they are more accurate

  • The assumptions must be verified carefully before applying model-based statistical methods to handle missing data.[5,6]

CONCLUSION

Appropriate identification and management of missing data patterns are critical for ensuring the validity of medical research findings. Researchers must carefully select statistical methods that align with missingness patterns and routinely verify these assumptions through sensitivity analysis.

Ethical approval:

Institutional Review Board approval is not required.

Declaration of patient consent:

Patient’s consent not required as there are no patients in this study.

Conflict of interest:

There are no conflicts of interest.

Use of artificial intelligence (AI)-assisted technology for manuscript preparation:

The authors confirm that there was no use of artificial intelligence (AI)-assisted technology for assisting in the writing or editing of the manuscript, and no images were manipulated using AI.

Financial support and sponsorship: Nil.

References

  1. . Clinical research and medical care: Towards effective and complete integration. BMC Med Res Methodol. 2015;15:4.
    [CrossRef] [PubMed] [Google Scholar]
  2. . Unlocking the power of health datasets and registries: The need for urgent institutional and national ownership and governance regulations for research advancement. J Nat Sci Med. 2023;6:159-65.
    [CrossRef] [Google Scholar]
  3. , . Handling missing data in research. Perspect Clin Res. 2024;15:99-101.
    [CrossRef] [PubMed] [Google Scholar]
  4. . How can I deal with missing data in my study? Aust N Z J Public Health. 2001;25:464-9.
    [CrossRef] [PubMed] [Google Scholar]
  5. , , , . Missing data. MIT Critical Data, editor. Secondary analysis of electronic health records (1st ed). Cham: Springer Nature; . p. :143-58.
    [CrossRef] [Google Scholar]
  6. , , . Missing data in clinical studies. Int J Radiat Oncol Biol Phys. 2021;110:1267-71.
    [CrossRef] [PubMed] [Google Scholar]
Show Sections