A supervised weeding method to cluster high dimensional predictors with application to job market analysis

Yuyang Li; Jianxin Bi; Jingyuan Liu; Ying Yang

doi:10.1080/02664763.2024.2348634

A supervised weeding method to cluster high dimensional predictors with application to job market analysis

J Appl Stat. 2024 May 1;51(16):3350-3365. doi: 10.1080/02664763.2024.2348634. eCollection 2024.

Authors

Yuyang Li¹, Jianxin Bi², Jingyuan Liu², Ying Yang³

Affiliations

¹ Department of Statistics, Iowa State University, Ames, IA, USA.
² MOE Key Laboratory of Econometrics, Department of Statistics, School of Economics, Xiamen University Laboratory of Digital Finance, and Fujian Key Lab of Statistics, Xiamen University, Xiamen, People's Republic of China.
³ Department of Data Science, Tencent Cloud Technology Co., Wuhan, People's Republic of China.

PMID: 39628857
PMCID: PMC11610315 (available on 2025-05-01)
DOI: 10.1080/02664763.2024.2348634

Abstract

The clustering of high-dimensional predictors draws increasing attention in various scientific areas, such as text mining and biological data analysis. In standard clustering procedures, when predictors are clustered, they only showcase the inherent patterns within the predictor set, lacking the capacity to predict the response variable. To this end, a new supervised weeding algorithm is advocated to address the dual requirement of detecting sparse clusters and capturing the prediction effects. The proposed algorithm is based on an iterative feature screening and coherence evaluation procedure. It iteratively weeds off the unimportant predictors in a backward fashion, forming sequences of nested sets to determine data-driven optimal cut-offs. This study uses Monte Carlo simulation to assess the finite-sample performance of the proposed method. The findings demonstrate that both the clustering and prediction performance of the proposed method are comparable to existing methods that concentrate solely on one aspect of the dual targets. An analysis of a job description dataset is conducted to explore significant groups of keywords that affect employees' salaries.

Keywords: Data analysis; Supervised clustering; big data; clustering predictors; feature screening; high dimensional data; high dimensional data analysis; text mining.

Grants and funding

Bi and Liu's research were supported by NNSFC 12271456, 71988101 and the Ministry of Education Research in the Humanities and Social Sciences 22YJA910002.