18 / 2025-03-21 10:56:24
ViT-driven Visual Extraction in R2GenCMN for medical report generation
vision transformer, radiology report generation, CMN
全文待审
晶 王 / 北方工业大学;电气与控制工程学院
超 凌 / 北方工业大学;电气与控制工程学院
俊妍 樊 / 北方工业大学;电气与控制工程学院
萌 周 / 北方工业大学;电气与控制工程学院
Medical imaging is a critical tool for diagnosing diseases. Its textual interpretation is vital for analysis and treatment planning. Automating report generation eases radiologists' workload and boosts clinical automation. This field has gained attention with the development of AI in healthcare. Prior studies often used pre-trained CNNs like VGG and ResNet for feature extraction. However, despite their effectiveness, their local receptive fields limit the ability to capture global context. To overcome this limitation, this study proposes a novel approach, ViT-R2GenCMN, which employs the Vision Transformer (ViT) to replace the traditional CNN-based visual extractor. ViT processes images by segmenting them into patch sequences and utilizes a self-attention mechanism to directly model global dependencies. In this work, we integrate ViT into the Cross-modal Memory Networks (CMN) framework to improve the alignment between visual and textual information, thereby enhancing the quality of generated radiology reports. This paper examines the applicability of ViT in the task of radiology report generation and evaluates its specific impact on cross-modal interaction performance. Experimental results demonstrate the effectiveness of our proposed model, achieving state-of-the-art performance on the widely recognized IU X-Ray benchmark dataset.
重要日期
  • 会议日期

    08月22日

    2025

    08月24日

    2025

  • 04月25日 2025

    初稿截稿日期

主办单位
中国自动化学会技术过程的故障诊断与安全性专业委员会
承办单位
新疆大学
新疆自动化学会
移动端
在手机上打开
小程序
打开微信小程序
客服
扫码或点此咨询