Translate this page into:
Missing data in medical research: A practical guide to identification and management

*Corresponding author: Mahalakshmy Thulasingam, Department of Preventive and Social Medicine, Jawaharlal Institute of Postgraduate Medical Education and Research, Puducherry, India. mahalakshmi.dr@gmail.com
-
Received: ,
Accepted: ,
How to cite this article: Thulasingam M, Bhengra MS. Missing data in medical research: A practical guide to identification and management. CosmoDerma. 2025;5:140. doi: 10.25259/CSDM_197_2025
INTRODUCTION
Medical research is vital for disseminating evidence-based knowledge, enabling clinicians to make informed decisions and improve individual patient care.[1] Medical data include a wide range of information, such as a patient’s demographic details, medical history, investigations, diagnoses, treatments, and outcomes.[2] These datasets are highly useful for scientific progress. However, medical research frequently encounters missing data due to data entry errors, incomplete questionnaires, loss to follow-up, participant dropout, selective responses, and disruption in electronic health records. These missing values reduce the sample size, thereby causing biased statistical analyses that can affect the accuracy and validity of the results. Identifying the missing data type and pattern helps researchers choose appropriate statistical methods for managing the data. This article aims to assist clinicians and early-career academics in effectively utilizing these techniques in medical research.
MISSING DATA IDENTIFICATION
During data collection, three primary categories of missing data are observed: “Missing Completely at Random (MCAR),” “Missing at Random (MAR),” and “Not Missing at Random (NMAR).” Researchers must first hypothesize the nature of the missingness in the dataset and apply the appropriate statistical methods, even though no formal test is available to accurately validate these assumptions.
Missing completely at random (MCAR) refers to missing data that occur in a purely random pattern. The likelihood of a missing value is the same for all observed variables in the dataset. However, this assumption is rarely satisfied.[3,4] The most common example is observer bias, when someone else performs clinical assessments in the absence of the primary observer. Here, missing entries occur randomly and are not related to the patient’s characteristics, disease severity, or treatment outcomes. Occasionally, transport barriers and adverse weather conditions may also prevent patients from attending follow-up visits.
Missing at Random (MAR) refers to a situation in which known data can be used to predict missing data. Thus, the likelihood of a missing value is related to the patient characteristics measured in the clinical study. MAR is commonly observed in medical research studies.[3,4] For example, the dermatology life quality index is a patient-reported questionnaire, and a few patients with low literacy tend to leave more questions unanswered. Likewise, a few patients prescribed steroid medications may show poor adherence due to fear of side effects. Some female patients may also refuse a complete examination of skin lesions due to privacy concerns.
Not MAR (NMAR) describes scenarios in which missing data explicitly depend on unobserved variables. The observed data alone cannot predict missingness.[3,4] For example, a few patients with diabetes and recurrent candidiasis may miss follow-up appointments because they have a negative attitude toward taking their medications as prescribed, which can result in high blood sugar levels. Furthermore, the Urticaria Activity Score (UAS7) is used to track disease condition or treatment response in patients with urticaria. However, the patient may skip recording daily logs due to intense itching and discomfort, causing irregularities in the clinical review.
Another classification of missing data is recoverable and non-recoverable data. Recoverable data refer to cases where missingness occurs randomly and is typically due to unidentifiable reasons. In contrast, non-recoverable data occur when missingness is explained by available data or an unobserved value, resulting in more biased data.[5]
METHODS FOR HANDLING MISSING DATA
Complete case analysis
This straightforward approach eliminates all records with missing values and assumes that the remaining data represent the entire study population if missingness is MCAR. However, this method reduces statistical power and may introduce bias if the MCAR assumption is violated.[3-6] For example, in the scenario is shown in Table 1, the highlighted cases are discarded and not included in the analysis.
| Age | Gender | Diabetes | Candidiasis |
|---|---|---|---|
| 42 | Male | Yes | Yes |
| 55 | Female | ____ | No |
| 38 | Male | Yes | No |
| 46 | Male | Yes | ____ |
| 60 | Female | Yes | Yes |
| 72 | Male | ____ | No |
| 68 | Female | Yes | No |
| ____ | Male | No | Yes |
| 45 | Female | Yes | Yes |
| 35 | Female | No | Yes |
Available case analysis
This technique removes only observations with missing data for the specific variable being analyzed. Although it preserves more data than a complete case analysis, comparisons across analyses are difficult because of the varying sample sizes and compositions [Table 2].[3-6] Here, if individual variables are to be separately analyzed, then only missing observations (highlighted cells) are discarded.
| Age | Gender | Diabetes | Candidiasis |
|---|---|---|---|
| 42 | Male | Yes | Yes |
| 55 | Female | No | |
| 38 | Male | Yes | No |
| 46 | Male | Yes | ____ |
| 60 | ____ | Yes | Yes |
| 72 | Male | ____ | No |
| 68 | Female | Yes | No |
| ____ | Male | No | Yes |
| 45 | Female | Yes | ____ |
| 35 | Female | No | Yes |
Mean/median substitution
This method substitutes all missing values with the mean/median of the available observations and is therefore easy and fast. If the dataset contains outliers, the median is a reliable method. However, it underestimates variability compared to the deletion method.[3-6] By listing the available age data from above: 42, 55, 38, 46, 60, 72, 68, missing value, 45, 35. The mean age of 51 years is imputed for the missing values.
Linear interpolation
This is mostly applied to time-series data. For example, the Visual Analog Scale (VAS) is used to measure the intensity of symptoms such as itching and burning. If the VAS observation is missing at the 4th week, it is estimated by interpolating between the observations from the previous and subsequent weeks (i.e., the 3rd and 5th weeks).[5] Here, the red-marked point indicates the substituted value at week 4 [Figure 1].

- Linear interpolation method for missing data.
Hot deck/cold deck imputation
Hot deck imputation replaces missing values with data from similar records within the same dataset. However, cold deck imputation replaces missing values with data from external sources or historical records. Both strategies tend to preserve authentic data distributions but are significantly dependent on the quality and relevance of the donor data.[4,5] For example, the patient’s missing age can be replaced either by similar cases in the dataset (hot deck) or the average age taken from the National Health Survey Data (cold deck).
Last observation carried forward
This strategy addresses the absence of data in longitudinal investigations by carrying forward the most recent measurement to replace missing values. This is commonly used in dermatological clinical trials, where patients often miss follow-up visits due to severe symptoms or long treatment durations. However, disease status is assumed to remain stable from the last visit, which rarely occurs in practice, resulting in bias.[4,5]
Regression imputation
Regression models predict the missing data points based on the observed variables. This preserves the relationships between variables but may create artificial correlations, potentially biasing the model, particularly when missingness patterns vary across multiple variations.[3-6] For example, the missing VAS observation for itching at the 4th week of an 8-week study is predicted by software that utilizes observed values such as the VAS baseline score, age, and gender, employing a regression model.
Multiple imputation
Multiple imputations generate several datasets (typically five to ten) by replacing missing data with plausible values derived from predictive modeling. When VAS readings are missing, this method generates multiple plausible values based on each patient’s characteristics, performs analysis multiple times, and combines the results to produce a pooled value through software. This provides a comprehensive and robust estimate.[3-6]
Maximum likelihood technique
This statistical method handles missing values by estimating parameters such as mean, variance, and regression coefficients by maximizing the likelihood of the observed data. It does not fill in missing values with a single number. If the assumptions regarding the MAR mechanisms are valid, they offer unbiased estimates.[3-6]
Pattern-mixture model
While this model does not require specific assumptions, it remains the preferred approach when missingness depends on unobserved data. For example, when UAS7 data are missing for a small number of patients, they are first classified based on their reporting patterns (single-day reporting, discontinuation after 2 days, or complete 7-day reporting). The software then provides separate outcomes that are combined to produce results with minimal bias.[4]
Sensitivity analysis
This is beneficial when uncertainty is associated with missing data.[6] For example, when assessing UAS7 scores on patients, some patients with severe symptoms may not report them, suggesting NMAR missingness. This analysis is performed by imputing missing values, using the best clinical outcome (urticaria-free) and worst clinical outcome (severe symptoms with urticaria), then observing whether the treatment effects remain the same or change.
The selection of an appropriate method for handling missing data depends on the missingness mechanism and the proportion of missing observations. Common techniques for handling missing data are summarized in Figure 2.[3-6]

- Missing data methods: A decision flowchart towards center.
Tips and warnings for handling missing data
The proportion of missing data must be reported before the key results
Whenever feasible, the underlying cause of missingness should be determined
Sensitivity analyses should be performed to verify the assumptions and interpretations
Perform a complete case analysis and combine it with sensitivity analysis, especially when the MCAR assumption is not met
Single imputation methods should be avoided for primary outcome predictions; instead, multiple imputation or regression methods should be preferred as they are more accurate
The assumptions must be verified carefully before applying model-based statistical methods to handle missing data.[5,6]
CONCLUSION
Appropriate identification and management of missing data patterns are critical for ensuring the validity of medical research findings. Researchers must carefully select statistical methods that align with missingness patterns and routinely verify these assumptions through sensitivity analysis.
Ethical approval:
Institutional Review Board approval is not required.
Declaration of patient consent:
Patient’s consent not required as there are no patients in this study.
Conflict of interest:
There are no conflicts of interest.
Use of artificial intelligence (AI)-assisted technology for manuscript preparation:
The authors confirm that there was no use of artificial intelligence (AI)-assisted technology for assisting in the writing or editing of the manuscript, and no images were manipulated using AI.
Financial support and sponsorship: Nil.
References
- Clinical research and medical care: Towards effective and complete integration. BMC Med Res Methodol. 2015;15:4.
- [CrossRef] [PubMed] [Google Scholar]
- Unlocking the power of health datasets and registries: The need for urgent institutional and national ownership and governance regulations for research advancement. J Nat Sci Med. 2023;6:159-65.
- [CrossRef] [Google Scholar]
- Handling missing data in research. Perspect Clin Res. 2024;15:99-101.
- [CrossRef] [PubMed] [Google Scholar]
- How can I deal with missing data in my study? Aust N Z J Public Health. 2001;25:464-9.
- [CrossRef] [PubMed] [Google Scholar]
- Missing data. MIT Critical Data, editor. Secondary analysis of electronic health records (1st ed). Cham: Springer Nature; 2016. p. :143-58.
- [CrossRef] [Google Scholar]
- Missing data in clinical studies. Int J Radiat Oncol Biol Phys. 2021;110:1267-71.
- [CrossRef] [PubMed] [Google Scholar]