Study on Data Preprocessing for Machine Learning Based on Semiconductor Manufacturing Processes

Ha-Je Park; Yun-Su Koo; Hee-Yeong Yang; Young-Shin Han; Choon-Sung Nam

doi:10.3390/s24175461

Study on Data Preprocessing for Machine Learning Based on Semiconductor Manufacturing Processes

Sensors (Basel). 2024 Aug 23;24(17):5461. doi: 10.3390/s24175461.

Authors

Ha-Je Park¹, Yun-Su Koo², Hee-Yeong Yang¹, Young-Shin Han³, Choon-Sung Nam¹

Affiliations

¹ Department of Software Convergence Engineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon 22212, Republic of Korea.
² Department of Mechatronics Engineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon 22212, Republic of Korea.
³ Frontier College, Inha University, 100 Inha-ro, Michuhol-gu, Incheon 22212, Republic of Korea.

Abstract

Various data types generated in the semiconductor manufacturing process can be used to increase product yield and reduce manufacturing costs. On the other hand, the data generated during the process are collected from various sensors, resulting in diverse units and an imbalanced dataset with a bias towards the majority class. This study evaluated analysis and preprocessing methods for predicting good and defective products using machine learning to increase yield and reduce costs in semiconductor manufacturing processes. The SECOM dataset is used to achieve this, and preprocessing steps are performed, such as missing value handling, dimensionality reduction, resampling to address class imbalances, and scaling. Finally, six machine learning models were evaluated and compared using the geometric mean (GM) and other metrics to assess the combinations of preprocessing methods on imbalanced data. Unlike previous studies, this research proposes methods to reduce the number of features used in machine learning to shorten the training and prediction times. Furthermore, this study prevents data leakage during preprocessing by separating the training and test datasets before analysis and preprocessing. The results showed that applying oversampling methods, excluding KM SMOTE, achieves a more balanced class classification. The combination of SVM, ADASYN, and MaxAbs scaling showed the best performance with an accuracy and GM of 85.14% and 72.95%, respectively, outperforming all other combinations.

Keywords: SECOM dataset; geometric mean; machine learning; oversampling; semiconductor manufacturing process.

Grants and funding

IITP-2024-RS-2023-00259678/Institute for Information and Communications Technology Promotion