PlanText: Gradually Masked Guidance to Align Image Phenotypes with Trait Descriptions for Plant Disease Texts

Kejun Zhao; Xingcai Wu; Yuanyuan Xiao; Sijun Jiang; Peijia Yu; Yazhou Wang; Qi Wang

doi:10.34133/plantphenomics.0272

PlanText: Gradually Masked Guidance to Align Image Phenotypes with Trait Descriptions for Plant Disease Texts

Plant Phenomics. 2024 Nov 26:6:0272. doi: 10.34133/plantphenomics.0272. eCollection 2024.

Authors

Kejun Zhao¹, Xingcai Wu¹, Yuanyuan Xiao¹, Sijun Jiang¹, Peijia Yu¹, Yazhou Wang², Qi Wang¹

Affiliations

¹ State Key Laboratory of Public Big Data, School of Computer Science and Technology, Guizhou University, Guiyang 550025, China.
² School of Information, Guizhou University of Finance and Economics, Guiyang 550025, China.

Abstract

Plant diseases are a critical driver of the global food crisis. The integration of advanced artificial intelligence technologies can substantially enhance plant disease diagnostics. However, current methods for early and complex detection remain challenging. Employing multimodal technologies, akin to medical artificial intelligence diagnostics that combine diverse data types, may offer a more effective solution. Presently, the reliance on single-modal data predominates in plant disease research, which limits the scope for early and detailed diagnosis. Consequently, developing text modality generation techniques is essential for overcoming the limitations in plant disease recognition. To this end, we propose a method for aligning plant phenotypes with trait descriptions, which diagnoses text by progressively masking disease images. First, for training and validation, we annotate 5,728 disease phenotype images with expert diagnostic text and provide annotated text and trait labels for 210,000 disease images. Then, we propose a PhenoTrait text description model, which consists of global and heterogeneous feature encoders as well as switching-attention decoders, for accurate context-aware output. Next, to generate a more phenotypically appropriate description, we adopt 3 stages of embedding image features into semantic structures, which generate characterizations that preserve trait features. Finally, our experimental results show that our model outperforms several frontier models in multiple trait descriptions, including the larger models GPT-4 and GPT-4o. Our code and dataset are available at https://plantext.samlab.cn/.