Inconsistency in the use of the term "validation" in studies reporting the performance of deep learning algorithms in providing diagnosis from medical imaging

Dong Wook Kim; Hye Young Jang; Yousun Ko; Jung Hee Son; Pyeong Hwa Kim; Seon-Ok Kim; Joon Seo Lim; Seong Ho Park

doi:10.1371/journal.pone.0238908

Inconsistency in the use of the term "validation" in studies reporting the performance of deep learning algorithms in providing diagnosis from medical imaging

PLoS One. 2020 Sep 11;15(9):e0238908. doi: 10.1371/journal.pone.0238908. eCollection 2020.

Authors

Dong Wook Kim¹, Hye Young Jang², Yousun Ko³, Jung Hee Son¹, Pyeong Hwa Kim¹, Seon-Ok Kim⁴, Joon Seo Lim⁵, Seong Ho Park¹

Affiliations

¹ Department of Radiology and Research Institute of Radiology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea.
² Department of Radiology, National Cancer Center, Goyang, Republic of Korea.
³ Biomedical Research Center, Asan Institute for Life Sciences, Asan Medical Center, Seoul, Republic of Korea.
⁴ Department of Clinical Epidemiology and Biostatistics, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea.
⁵ Scientific Publications Team, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea.

Abstract

Background: The development of deep learning (DL) algorithms is a three-step process-training, tuning, and testing. Studies are inconsistent in the use of the term "validation", with some using it to refer to tuning and others testing, which hinders accurate delivery of information and may inadvertently exaggerate the performance of DL algorithms. We investigated the extent of inconsistency in usage of the term "validation" in studies on the accuracy of DL algorithms in providing diagnosis from medical imaging.

Methods and findings: We analyzed the full texts of research papers cited in two recent systematic reviews. The papers were categorized according to whether the term "validation" was used to refer to tuning alone, both tuning and testing, or testing alone. We analyzed whether paper characteristics (i.e., journal category, field of study, year of print publication, journal impact factor [JIF], and nature of test data) were associated with the usage of the terminology using multivariable logistic regression analysis with generalized estimating equations. Of 201 papers published in 125 journals, 118 (58.7%), 9 (4.5%), and 74 (36.8%) used the term to refer to tuning alone, both tuning and testing, and testing alone, respectively. A weak association was noted between higher JIF and using the term to refer to testing (i.e., testing alone or both tuning and testing) instead of tuning alone (vs. JIF <5; JIF 5 to 10: adjusted odds ratio 2.11, P = 0.042; JIF >10: adjusted odds ratio 2.41, P = 0.089). Journal category, field of study, year of print publication, and nature of test data were not significantly associated with the terminology usage.

Conclusions: Existing literature has a significant degree of inconsistency in using the term "validation" when referring to the steps in DL algorithm development. Efforts are needed to improve the accuracy and clarity in the terminology usage.

Publication types

Systematic Review

MeSH terms

Algorithms*
Diagnostic Imaging / methods*
Humans
Journal Impact Factor
Machine Learning*
Periodicals as Topic / standards*
Validation Studies as Topic

Grants and funding

The authors received no specific funding for this work.