Missing data and cluster graphs: cluster-level missingness vs variable-level missingness

Abstract

Missing data is pervasive in many scientific domains such as public health, environmental science, and the social sciences. Recoverability from missing data is typically studied using fully specified variable-level missingness models despite that, in many applications, only coarse structural information is available, for instance when variables are grouped into clusters due to limited knowledge or interpretability reasons. In this paper, we investigate recoverability from such abstract representations. We introduce two classes of cluster-based missingness graphs: the m-C-DMG, which retains variable-specific missingness indicators, and the cm-C-DMG, which aggregates missingness mechanisms at the cluster level. We formalize the notion of compatibility between these abstract graphs and underlying variable-level missingness models, and study how this abstraction affects the recoverability of probabilistic and causal queries. In particular, we give graphical conditions of recovering the joint distribution as well as graphical conditions of recovering a macro causal effect. Overall, our results clarify when cluster-level missingness information is sufficient for valid inference, and when finer-grained modeling is necessary.

Publication
In arXiv
Willow Scott
Willow Scott
master intern

Willow is an intern working under the supervision of Charles Assaad and Eugenio Valdano.

Eugenio Valdano
Eugenio Valdano
Principal Investigator