Event

Statistical Significance of Clustering for High-Dimensional Complex Data

Yufeng Liu
John D. MacArthur Professor of Statistics
University of Michigan

 

Date: 18 June 2026, Thursday

Time: 3 pm, Singapore

Venue: S16-06-118, Seminar Room

 

Clustering is widely used in biomedical research to identify meaningful subgroups. However, most existing clustering algorithms do not account for the statistical uncertainty inherent in the resulting clusters, which can lead to spurious findings arising from natural sampling variation. To address this issue, the Statistical Significance of Clustering (SigClust) method was developed to formally assess the significance of clusters in high-dimensional data. In this talk, we begin by defining a cluster as a group of observations generated from a single Gaussian distribution and formulate the evaluation of clustering significance as a hypothesis testing problem. We then discuss challenges related to high-dimensional covariance estimation in SigClust and introduce an enhanced version that incorporates multidimensional scaling (MDS) based on dissimilarity matrices to address these challenges. To extend the methodology beyond continuous data, we further propose SigClust-DEV, a recent approach designed to assess the significance of clustering for count data. We will also discuss more recent nonparametric extensions based on generative models. Finally, we illustrate the application of SigClust to single-cell RNA sequencing (scRNA-seq) data and electronic health records (EHRs), demonstrating its utility in real-world biomedical applications.