欢迎访问《高校地质学报》官方网站,今天是
分享到:

高校地质学报 ›› 2023, Vol. 29 ›› Issue (3): 429-438.DOI: 10.16108/j.issn1006-7493.2023028

• 固体地球科学文本挖掘和知识图谱专栏 特邀主编:马 超 诸云强 闾海荣 胡修棉 • 上一篇    下一篇

面向中文文本的地质语义信息标注与语料库构建

张雪英1,张春菊2, 3*,汪 陈3,刘文聪3,叶 鹏4,鲁艳旭1   

  1. 1. 南京师范大学 虚拟地理环境教育部重点实验室,南京 210046;
    2. 自然资源部 江淮耕地资源保护与生态修复重点实验室,合肥 230036;
    3. 合肥工业大学 土木与水利工程学院,合肥 230009;
    4. 扬州大学 城市规划与发展研究院,扬州 225127
  • 出版日期:2023-06-20 发布日期:2023-06-20

Chinese Text-oriented Geological Semantic Information Annotation and Corpus Construction

ZHANG Xueying1,ZHANG Chunju2, 3*,WANG Chen3,LIU Wencong3,PENG Ye4,LU Yanxu1   

  1. 1. Institute of Geographical Science, Nanjing Normal University, Nanjing 210046, China;
    2. The Key Laboratory of JiangHuai Arable Land Resources Protection and Eco-restoration, Ministry of Natural Resources, Hefei 230036, China;
    3. The School of Civil Engineering, Hefei University of Technology, Hefei 230009, China;
    4. Urban Planning and Development Institute, Yangzhou University, Yangzhou 225127, China
  • Online:2023-06-20 Published:2023-06-20

摘要: 实现文本中地质信息的结构化抽取、语义解析、可视化表达和知识图谱构建,将为地质大数据的深度挖掘与利用提供有力的数据基础和技术支撑。无论是采用传统统计模型还是深度学习模型,地质信息语义解析均需要已标注的语料库的支持。特别是,地质信息的文本描述具有领域性特征,无法通过通用自然语言语料迁移实现。因此,不同层次的地质信息标注语料库的构建成为地质语义信息解析的关键和基础。文章在分析中文文本中地质语义信息描述语言特点的基础上,从地质实体的时空和属性描述特征出发,清晰表达地质实体的各种语义关系,制定了中文文本的地质语义信息标注体系和标注规范,自主研发了“交互式地质语义信息标注工具”,解决了传统人工标注存在错误率高、重复工作量大等缺点,以矿产资源的中文研究文献和报告为数据源,构建了大规模地质语义信息标注语料库,较为有效地解决了当前相关标准和规模化标准数据匮乏的问题。

关键词: 中文文本, 地质实体, 语义关系, 标注体系, 标注规范

Abstract: The structured extraction of geological information, semantic analysis, visual expression and the construction of knowledge map in text will provide a strong data foundation and technical support for the deep mining and utilization of geological big data. Whether it is a traditional statistical model or a deep learning model, the semantic analysis of geological information needs the support of tag corpus. In particular, the textual description of geological information has domain characteristics and cannot be achieved by migrating natural language corpora. Therefore, the construction of different levels of geological information annotation corpus has become the key foundation of geological semantic information analysis. Based on the analysis of the characteristics of the geological semantic information description language in Chinese text, according to the spatial and temporal characteristics and attribute description features of the geological entities, various semantic relations of geological entities are clearly expressed, and the geological semantic information is formed, formulating Chinese text labeling system and labeling specifications. The self-developed “interactive geological semantic information labeling tool”solves the shortcomings of traditional manual labeling methods such as high error rates and large workload. Using Chinese mineral resources literature and reports as data sources, a large-scale geological semantic information annotation corpus is constructed, which effectively solves the problem of the lack of large-scale standard data.

Key words: Chinese text, geological entity, semantic relationship, labeling system, labeling specification

中图分类号: