Similarity-Based Machine Learning for Small Data Sets: Predicting Biolubricant Base Oil Viscosities

Jae Young Kim; Salman A Khan; Dionisios G Vlachos

doi:10.1021/acs.jpcb.4c06687

Similarity-Based Machine Learning for Small Data Sets: Predicting Biolubricant Base Oil Viscosities

J Phys Chem B. 2024 Nov 23. doi: 10.1021/acs.jpcb.4c06687. Online ahead of print.

Authors

Jae Young Kim^{1

2}, Salman A Khan², Dionisios G Vlachos^{1

2}

Affiliations

¹ Department of Chemical and Biomolecular Engineering, University of Delaware, Newark, Delaware 19716, United States.
² Delaware Energy Institute (DEI), University of Delaware, 221 Academy St., Newark, Delaware 19716, United States.

PMID: 39579140
DOI: 10.1021/acs.jpcb.4c06687

Abstract

Machine learning (ML) has been successfully applied to learn patterns in experimental chemical data to predict molecular properties. However, experimental data can be time-consuming and expensive to obtain and, as a result, it is often scarce. Several ML methods face challenges when trained with limited data. Here, we introduce a similarity-based machine learning approach that enables precise model training on small data sets while requiring fewer features and enhancing prediction accuracy. We group molecules with similar structures, represented by molecular fingerprints, and use these groups to train separate ML models for each group. We first validate our method on larger data sets of dynamic viscosity and aqueous solubility, demonstrating comparable or better performance than traditional approaches while requiring fewer features. We then apply the validated methodology to predict the kinematic viscosity of biolubricant base oil molecules at 40 °C (KV40), where experimental data is particularly limited. Our method shows noticeable model performance improvement for KV40 prediction compared to transfer learning and the standard Random Forest. This approach provides a robust framework for limited data that can be readily generalized to a diverse range of molecular data sets especially when clear structural patterns exist in the data set.