基于BERTwwm与数据增强的地质实体识别研究
DOI:
作者:
作者单位:

1.长江大学地球科学学院;2.中国石化胜利油田分公司

作者简介:

通讯作者:

中图分类号:

基金项目:

国家自然科学基金(42172172,42130813); 油气资源与勘探技术教育部重点实验室(长江大学)开放基金资助项目(PI2023-04)


Research on Geological Entity Recognition Based on BERTwwm and Data Augmentation
Author:
Affiliation:

1.School of GeoSciences, Yangtze University.;2.SINOPEC Shengli Oilfield

Fund Project:

单位:
  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    地质命名实体识别是识别地质文本中的地质实体并分类到准确的地质概念中的一项地质知识智能抽取任务,也是构建地质领域知识图谱的关键技术之一。本研究针对地质命名实体识别领域中复杂实体识别精度不足和样本标注成本较高这两大挑战,构建了一种地质实体识别模型BERTwwm-BiLSTM-Attention-CRF,该模型通过改进的预训练层BERTwwm并在模型中加入Self-Attention模块,显著提升了复杂地质实体的识别精度,对地质实体识别的精度达到92.67%的精确率,94.21%的召回率,以及93.29%的F1值。同时,为降低标注成本,提升小规模数据集的识别精度,本研究优化了模型构建流程,采用模型辅助标注方法,加快数据集的标注速度;改进简单数据增强方法,并利用地质字典有效扩充数据集,降低了人工标注的难度。经过实验证明,本研究提出的改进方法提高了地质实体识别效果,为地质文本分析提供了一种高效且经济的解决方案,有助于推动地质领域知识图谱的构建和地质信息的智能化处理。

    Abstract:

    Geological Named Entity Recognition is the task of identifying geological entities in geological texts and categorizing them into accurate geological concepts. It is also one of the key technologies for constructing knowledge graphs in the geological domain. This research addresses two major challenges in the field of geological named entity recognition: the insufficient accuracy in complex entity recognition and the high cost of sample annotation. We have developed a geological entity recognition model, BERTwwm-BiLSTM-Attention-CRF. This model significantly enhances the recognition accuracy of complex geological entities by incorporating an improved pre-training layer, BERTwwm, and adding a Self-Attention module. It achieves a precision rate of 92.67%, a recall rate of 94.21%, and an F1-Score of 93.29%. To reduce annotation costs and improve recognition accuracy on small-scale datasets, this study optimizes the model construction process, employing a model-assisted annotation method to accelerate the dataset annotation speed. We have refined the Easy Data Augmentation (EDA) approach and expanded the dataset effectively using a geological dictionary, thus reducing the difficulty of manual annotation. Comparative experiments and ablation studies have proven that the improvements proposed in this study enhance the effectiveness of geological entity recognition. This offers an efficient and economical solution for geological text analysis, aiding the construction of knowledge graphs in the geological field and the intelligent processing of geological information.

    参考文献
    相似文献
    引证文献
引用本文
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-12-26
  • 最后修改日期:2024-02-06
  • 录用日期:2024-06-21
  • 在线发布日期:
  • 出版日期: