Handling Missing Data in Research Studies

Data are essential to research, but any experienced researcher knows that it’s nearly impossible to collect data without holes, biases, or flaws. Dr. Joseph Olsen, associate dean from the College of Family, Home and Social Sciences, addressed a recent Educational Inquiry, Measurement, and Evaluation(EIME) seminar with advice for “Handling Missing Data in Research Studies.”

A focus of Olsen’s presentation was missing data mechanisms. He instructed attendees how to distinguish between data missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). He noted that data are seldom classified into only one of these categories. “If we know or can convincingly argue that the data are not MNAR, then there is a test that can be done to find out if we are missing data components,” Olsen informed McKay School faculty and students. He explained the importance of identifying the categories of missing data:  MCAR data are ignorable, MAR data are conditionally ignorable, but MNAR data are non-ignorable.

Olsen also discussed the various patterns of missing data in research studies. Missing data by design means that the data were excluded by structure. For example, a researcher may vertically scale tests and not present all the test questions to every grade level. Other patterns of missing data he mentioned are univariate, nonotonic, file matching, and general. The univariate missing data pattern is more tractable. Nonotonic patterns occur when some of the study participants drop out before completing the study. File matching patterns are missing half of the data. The general pattern, which involves data missing over a wide area,  is the most common pattern of missing data.

After a researcher has determined that data are missing and which mechanism and pattern are involved, the most important responses, according to Dr. Olsen, are to preserve the essential characteristics of the data, maintain the representation of the analyzed data, and provide valid statistical inference. He discussed the problems of older missing data treatments, including deletion methods, cold deck imputation, hot deck imputation, mean substitution, regression imputation, special methods for longitudinal studies, and special methods for multi-item scales. He criticized these common treatments, saying that many of them shrink the variant, the covariant, and the correlation, as well as the variant around the regression line.

Olsen introduced newer missing data treatments, including modern-state-of-the-art missing data treatments for MAR data: max likelihood and multiple imputation. “I’m fully behind max likelihood and multiple imputation,” Olsen told faculty and students. Max likelihood estimates summary statistics or statistic models using all average data. Multiple imputation inputs individual data in multiple component data sets, averaging the results of statistical analysis across the data sets.

Despite the limitations and flaws of older missing data treatments, Olsen believes many social scientists continue to use the techniques because they are unaware of or unfamiliar with newer treatments or unconvinced of the problems with older methods. Olsen added that the literature on missing data treatments is technically daunting, and journal reviewers and editors do not require the use of newer treatments.

21 December 2009