151 / 2025-04-10 16:10:20
GeoGPT Transforming Paleontology with AI-Powered Data Extraction and Analysis
GeoGPT,Geobiological Knowledge Extraction,Multimodal Artificial Intelligence
摘要待审
ye yufei / ZheJiang Lab
James Ogg / Purdue University;Chengdu University of Technology
Juye Wei / Zhejiang Lab
Zongyuan Xiang / Zhejiang Lab
Zhong Peng / zhejiang lab
Shuang Li / Zhejiang lab
Shao Qi Yu / Zhejiang lab
Jiang Yang / Zhejiang lab
The exponential growth of geobiological literature presents unprecedented challenges in data extraction efficiency, particularly when dealing with century-spanning paleontological archives and complex stratigraphic records. Traditional methods struggle with three core issues:

  1. Weak semantic associations in long-form geological texts spanning multiple research paradigms;

  2. Structural complexity of cross-page scientific tables with nested hierarchies;

  3. Incompatible data formats across historical publications.


To address these challenges, we present GeoGPT – a non-profit domain-specific multimodal AI system engineered for mining geobiological knowledge. GeoGPT integrates groundbreaking technologic frameworks:

Multimodal Architecture for Scientific Document Analysis. Our hybrid intelligence system bridges macro-scale semantic comprehension with micro-scale pattern detection creating an integrated pipeline for parsing text narratives, tabular hierarchies and schematic diagrams in geoscience literature. This Multimodal architecture specifically addresses the critical challenge of digitizing legacy data trapped in historical monographs and technical reports — automatically extracting fragmented paleontological observations from multi-format documents and transforming them into structured digital records. The structured outputs directly support large-scale evolutionary analyses by providing computationally tractable representations of taxonomic relationships, stratigraphic distributions and morphological characteristics preserved in century-old scientific archives.

Data Extraction Pipeline. Our cognitive-driven workflow transcends conventional end-to-end extraction paradigms through intent-aware computational design. By implementing demand decomposition via semantic requirement parsing, the system dynamically disambiguates extraction objectives and allocates subtasks across hybrid processing modules. This architecture synergizes GeoGPT's domain-specific knowledge retrieval with computer vision-driven diagram analysis, employing prompt-chaining mechanisms to maintain contextual coherence across multi-page document landscapes. Crucially, the pipeline incorporates multistage verification loops where extracted entities undergo automated reconciliation with source visual elements through graph-based backtracking algorithms. This paradigm shift achieves three fundamental advancements: 1) Significant mitigation of LLM hallucination through constraint-satisfaction processing; 2) Full traceability of data provenance via task-specific lineage tracking; and 3) Scalable adaptability from specimen-level feature extraction to ecosystem-scale pattern mining — capabilities unattainable through monolithic model approaches.

Benchmark Construction and Validation - Our interdisciplinary team has developed a tiered annotation framework combining AI-assisted pre-annotation with expert-led verification. The workflow begins with domain specialists from paleontology, paleomagnetism and petroleum geology defining entity taxonomies and stratigraphic relationship schemas. Trained annotators then perform initial labeling using our custom platform, which integrates active learning strategies to prioritize ambiguous cases for expert review. Current benchmarks encompass hundreds of peer-reviewed papers and technical reports, yielding 4,347 annotated instances across three disciplines: fossil occurrence records (31%), paleomagnetic polarity sequences (23%), hydrocarbon reservoir characteristics (49%). Each data point undergoes dual validation through cross-referencing with original visual elements and reconciliation with domain knowledge bases.

At the time of submission of this abstract, our ongoing development focuses on two strategic priorities:

  1. Specialized Model Training - Optimizing domain-specific extraction architectures to handle complex stratigraphic diagrams while maintaining computational efficiency.

  2. Cross-Domain Dataset Construction - Curating benchmark datasets spanning paleoclimate proxies, geochemical analyses and planetary surface features to enable systematic validation.

  3. AI-Reasoning Optimization– Developing domain-specific large language models with automated reasoning mechanisms that synergize contextual logic parsing and dynamic knowledge graph integration, significantly enhancing accuracy in deciphering ambiguous stratigraphic correlations and cross-modal geological patterns.


These parallel initiatives are establishing new paradigms for AI-assisted knowledge extraction in Earth sciences. Initial applications demonstrate robust performance in processing material science literature and astrogeological reports, confirming the framework's adaptability across geoscience subdisciplines.

 
重要日期
  • 会议日期

    06月10日

    2025

    06月13日

    2025

  • 04月15日 2025

    初稿截稿日期

主办单位
National Natural Science Foundation of China
Geobiology Society
National Committee of Stratigraphy of China
Ministry of Science and Technology
Geological Society of China
Paleontological Society of China
Nanjing Institute of Geology and Palaeontology, Chinese Academy of Sciences (CAS)
Institute of Vertebrate Paleontology and Paleoanthropology, CAS
International Commission on Stratigraphy
International Paleontological Association
承办单位
State Key Laboratory of Biogeology and Environmental Geology, China University of Geosciences (CUG, Wuhan)
联系方式
历届会议
移动端
在手机上打开
小程序
打开微信小程序
客服
扫码或点此咨询