Missing Data

My postdoctoral work involved the creation of an algorithm for imputation of missing data in a large agricultural survey (i.e., the USDA’s Agricultural Resource Management Survey). This work presented unique challenges due to the size and distributional structure of the dataset, and yielded several publications (Robbins & White, 2011; Robbins et al., 2013; Robbins & White, 2014; Robbins, 2014). The resulting algorithm contained several novel characteristics to facilitate theoretically valid and computationally efficient imputation with complex data, including copula modeling via transformation using empirical distributions, creative use of the SWEEP operator to improve efficiency, and construction of a joint model via a sequence of selected conditional models.

Motivated by specific issues encountered when performing imputation in a large Department of Defense survey while at RAND, I later generalized the above procedure to produce the GERBIL algorithm (Robbins, 2024), which is available in the R package gerbil (Robbins et al., 2023). By using a latent multivariate Gaussian model with probit-type assumptions for non-continuous variables, this method can create imputations in data of a general form (with continuous, binary, unordered categorical and ordinal variables) while using joint modeling in a highly computationally efficient manner and enables flexibility when constructing the imputation model. It is shown to outperform other state-of-the-art procedures in terms of both quality of imputations and computational burden.

Variance estimation in the presence of imputed data typically relies on algebraic expressions and the validity of multiple imputation combining rules. To improve the utility of imputed data in a more broad array of settings, I recently developed the theory that underpins the use of resampling procedures such as a bootstrap or jackknife with imputed data (Robbins & Burgette, 2025). This work illustrates the vast computation burden required for resampling procedures with imputed data, which emphasizes the value in efficient algorithms such as gerbil.

Collaborators:

Lane Burgette, RAND Corporation
Sujit Ghosh, NC State University
Joshua Habiger, Oklahoma State University
Kirk White, Census Bureau

References

Journal Articles

Farm commodity payments and imputation in the Agricultural Resource Management Survey

M. W. Robbins and T. K. White

American Journal of Agricultural Economics, 2011

HTML
Imputation in high dimensional economic data as applied to the Agricultural Resource Management Survey

M. W. Robbins, S. K. Ghosh, and J. D. Habiger

Journal of the American Statistical Association, 2013

Abs HTML

In this article, we consider imputation in the USDA’s Agricultural Resource Management Survey (ARMS) data, which is a complex, high-dimensional economic dataset. We develop a robust joint model for ARMS data, which requires that variables are transformed using a suitable class of marginal densities (e.g., skew normal family). We assume that the transformed variables may be linked through a Gaussian copula, which enables construction of the joint model via a sequence of conditional linear models. We also discuss the criteria used to select the predictors for each conditional model. For the purpose of developing an imputation method that is conducive to these model assumptions, we propose a regression-based technique that allows for flexibility in the selection of conditional models while providing a valid joint distribution. In this procedure, labeled as iterative sequential regression (ISR), parameter estimates and imputations are obtained using a Markov chain Monte Carlo sampling method. Finally, we apply the proposed method to the full ARMS data, and we present a thorough data analysis that serves to gauge the appropriateness of the resulting imputations. Our results demonstrate the effectiveness of the proposed algorithm and illustrate the specific deficiencies of existing methods. Supplementary materials for this article are available online.
Direct payments, cash rents, land values, and the effects of imputation in U.S. farm-level data

M. W. Robbins and T. K. White

Agricultural and Resource Economics Review, 2014

Abs HTML

Research using the Agricultural Resource Management Survey (ARMS) and other data shows that direct government payments to farmers increase rents and the price of land. However, some ARMS data is imputed and does not account for relationships between payments and other variables. We investigate various imputation methods and benefits gained from a method with a wide scope rather than a parsimonious range of variables. Using our method, we estimate that an additional dollar of direct payment increases land value about $2.69 more per acre than ARMS imputation methods and that our imputations (using an exhaustive iterative sequential regression) outperform other methods and/or smaller models.
The utility of nonparametric transformations for imputation of survey data

M. W. Robbins

Journal of Official Statistics, 2014

Abs HTML

Missing values present a prevalent problem in the analysis of establishment survey data. Multivariate imputation algorithms (which are used to fill in missing observations) tend to have the common limitation that imputations for continuous variables are sampled from Gaussian distributions. This limitation is addressed here through the use of robust marginal transformations. Specifically, kernel-density and empirical distribution-type transformations are discussed and are shown to have favorable properties when used for imputation of complex survey data. Although such techniques have wide applicability (i.e., they may be easily applied in conjunction with a wide array of imputation techniques), the proposed methodology is applied here with an algorithm for imputation in the USDA’s Agricultural Resource Management Survey. Data analysis and simulation results are used to illustrate the specific advantages of the robust methods when compared to the fully parametric techniques and to other relevant techniques such as predictive mean matching. To summarize, transformations based upon parametric densities are shown to distort several data characteristics in circumstances where the parametric model is ill fit; however, no circumstances are found in which the transformations based upon parametric models outperform the nonparametric transformations. As a result, the transformation based upon the empirical distribution (which is the most computationally efficient) is recommended over the other transformation procedures in practice.
Joint imputation of general data

M. Robbins

Journal of Survey Statistics and Methodology, 2024

Abs HTML

High-dimensional complex survey data of general structures (e.g., containing continuous, binary, categorical, and ordinal variables), such as the US Department of Defense’s Health-Related Behaviors Survey (HRBS), often confound procedures designed to impute any missing survey data. Imputation by fully conditional specification (FCS) is often considered the state of the art for such datasets due to its generality and flexibility. However, FCS procedures contain a theoretical flaw that is exposed by HRBS data—HRBS imputations created with FCS are shown to diverge across iterations of Markov Chain Monte Carlo. Imputation by joint modeling lacks this flaw; however, current joint modeling procedures are neither general nor flexible enough to handle HRBS data. As such, we introduce an algorithm that efficiently and flexibly applies multiple imputation by joint modeling in data of general structures. This procedure draws imputations from a latent joint multivariate normal model that underpins the generally structured data and models the latent data via a sequence of conditional linear models, the predictors of which can be specified by the user. We perform rigorous evaluations of HRBS imputations created with the new algorithm and show that they are convergent and of high quality. Lastly, simulations verify that the proposed method performs well compared to existing algorithms including FCS.
Resampling methods with multiply imputed data

M. W. Robbins and L. Burgette

Biometrika, 2025

Abs HTML

Resampling techniques have become increasingly popular for estimation of uncertainty. However, data are often fraught with missing values that are commonly imputed to facilitate analysis. This article addresses the issue of using resampling methods such as a jackknife or bootstrap in conjunction with imputations that have been sampled stochastically, in the vein of multiple imputation. We derive the theory needed to illustrate two key points regarding the use of resampling methods in lieu of traditional combining rules. First, imputations should be independently generated multiple times within each replicate group of a jackknife or bootstrap. Second, the number of multiply imputed datasets per replicate group must dramatically exceed the number of replicate groups for a jackknife; however, this is not the case in a bootstrap approach. We also discuss bias-adjusted analogues of the jackknife and bootstrap that are argued to require fewer imputed datasets. A simulation study is provided to support these theoretical conclusions.

Manuals

gerbil: Generalized Efficient Regression-Based Imputation with Latent Processes

M. Robbins, P. Lima, and M. Griswold

2023

R package version 0.1.9

HTML