Missing entries in multivariate data distort not only marginal summaries but also the covariance geometry that governs scale-adjusted and correlation-aware comparisons between observations. Motivated by covariance-sensitive downstream tasks, this paper develops a deterministic imputation framework driven by Mahalanobis distance. The first stage is a linear frozen-covariance procedure: missing entries are temporarily replaced by simple columnwise values, a fixed covariance matrix is computed, and the sum of the nonconstant squared Mahalanobis distances is minimized with respect to the unknown entries. Since the inverse covariance is fixed at that stage, the objective is quadratic and the first-order optimality conditions reduce to a linear system. The second stage is a nonlinear covariance-updating refinement in which the covariance matrix depends on the imputed values themselves and the optimization is performed locally, using the linear solution as initializer. We derive a compact matrix representation of the linear objective, give a sufficient full-rank condition guaranteeing uniqueness of the stationarity system, discuss the bias induced by freezing the covariance, and provide a regularized fallback for singular or ill-conditioned systems. The framework also clarifies its scope with respect to MCAR, MAR-type, and structured block masks, and uses covariance stabilization only as a numerical safeguard rather than as a determinant-minimization estimator. A repeated-mask experiment on the red wine quality dataset shows that the Mahalanobis method substantially improves on mean imputation at all masking levels and becomes the strongest among the tested methods at the highest missingness level considered. The resulting method is transparent, reproducible, and intended for moderate continuous-data settings in which preserving empirical covariance geometry is more important than fitting a large black-box model.
Citation: Alvaro H. Salas S., David L. Ocampo R., Lorenzo J. Martínez H.. Mahalanobis-geometry imputation for multivariate data with missing entries[J]. AIMS Mathematics, 2026, 11(5): 14641-14654. doi: 10.3934/math.2026600
Missing entries in multivariate data distort not only marginal summaries but also the covariance geometry that governs scale-adjusted and correlation-aware comparisons between observations. Motivated by covariance-sensitive downstream tasks, this paper develops a deterministic imputation framework driven by Mahalanobis distance. The first stage is a linear frozen-covariance procedure: missing entries are temporarily replaced by simple columnwise values, a fixed covariance matrix is computed, and the sum of the nonconstant squared Mahalanobis distances is minimized with respect to the unknown entries. Since the inverse covariance is fixed at that stage, the objective is quadratic and the first-order optimality conditions reduce to a linear system. The second stage is a nonlinear covariance-updating refinement in which the covariance matrix depends on the imputed values themselves and the optimization is performed locally, using the linear solution as initializer. We derive a compact matrix representation of the linear objective, give a sufficient full-rank condition guaranteeing uniqueness of the stationarity system, discuss the bias induced by freezing the covariance, and provide a regularized fallback for singular or ill-conditioned systems. The framework also clarifies its scope with respect to MCAR, MAR-type, and structured block masks, and uses covariance stabilization only as a numerical safeguard rather than as a determinant-minimization estimator. A repeated-mask experiment on the red wine quality dataset shows that the Mahalanobis method substantially improves on mean imputation at all masking levels and becomes the strongest among the tested methods at the highest missingness level considered. The resulting method is transparent, reproducible, and intended for moderate continuous-data settings in which preserving empirical covariance geometry is more important than fitting a large black-box model.
| [1] |
E. J. Candès, B. Recht, Exact matrix completion via convex optimization, Found. Comput. Math., 9 (2009), 717–772. http://doi.org/10.1007/s10208-009-9045-5 doi: 10.1007/s10208-009-9045-5
|
| [2] |
P. Cortez, A. Cerdeira, F. Almeida, T. Matos, J. Reis, Modeling wine preferences by data mining from physicochemical properties, Decis. Support Syst., 47 (2009), 547–553. http://doi.org/10.1016/j.dss.2009.05.016 doi: 10.1016/j.dss.2009.05.016
|
| [3] |
A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. B, 39 (1977), 1–22. http://doi.org/10.1111/j.2517-6161.1977.tb01600.x doi: 10.1111/j.2517-6161.1977.tb01600.x
|
| [4] | R. Gnanadesikan, Methods for statistical data analysis of multivariate observations, John Wiley & Sons, 1997. http://doi.org/10.1002/9781118032671 |
| [5] | R. A. Johnson, D. W. Wichern, Applied multivariate statistical analysis, Biometrics, 44 (1988), 920. |
| [6] | R. Little, D. Rubin, Statistical analysis with missing data, John Wiley & Sons, 2019. http://doi.org/10.1002/9781119482260 |
| [7] | P. C. Mahalanobis, On the generalized distance in statistics, Proc. Natl. Inst. Sci. India, 2 (1936), 49–55. |
| [8] | R. Mazumder, T. Hastie, R. Tibshirani, Spectral regularization algorithms for learning large incomplete matrices, J. Mach. Learn. Res., 11 (2010), 2287–2322. |
| [9] | D. B. Rubin, Multiple imputation for nonresponse in surveys, John Wiley & Sons, 1987. http://doi.org/10.1002/9780470316696 |
| [10] |
D. J. Stekhoven, P. Bühlmann, MissForest-non-parametric missing value imputation for mixed-type data, Bioinformatics, 28 (2012), 112–118. http://doi.org/10.1093/bioinformatics/btr597 doi: 10.1093/bioinformatics/btr597
|
| [11] |
O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, et al., Missing value estimation methods for DNA microarrays, Bioinformatics, 17 (2001), 520–525. http://doi.org/10.1093/bioinformatics/17.6.520 doi: 10.1093/bioinformatics/17.6.520
|
| [12] | S. van Buuren, Flexible imputation of missing data, New York: Chapman & Hall/CRC, 2018. http://doi.org/10.1201/9780429492259 |
| [13] | J. Yoon, J. Jordon, M. van der Schaar, GAIN: Missing data imputation using generative adversarial nets, In: Proceedings of the 35th International Conference on Machine Learning, 80 (2018), 5689–5698. |
| [14] | V. Fortuin, D. Baranchuk, G. Rätsch, S. Mandt, GP-VAE: Deep probabilistic multivariate time series imputation, In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, 108 (2020), 1651–1661. |
| [15] | Y. Tashiro, J. Song, Y. Song, S. Ermon, CSDI: Conditional score-based diffusion models for probabilistic time series imputation, In: Advances in Neural Information Processing Systems, 34 (2021), 24804–24816. |
| [16] |
W. Du, D. Côté, Y. Liu, SAITS: Self-attention-based imputation for time series, Expert Syst. Appl., 219 (2023), 119619. http://doi.org/10.1016/j.eswa.2023.119619 doi: 10.1016/j.eswa.2023.119619
|
| [17] | T. Du, L. Melis, T. Wang, ReMasker: Imputing tabular data with masked autoencoding, In: The Twelfth International Conference on Learning Representations, 2024. |
| [18] | J. Wang, W. Du, Y. Yang, L. Qian, W. Cao, K. Zhang, et al., Deep learning for multivariate time series imputation: A survey, In: Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, 2025, 10696–10704. http://doi.org/10.24963/ijcai.2025/1187 |