InstructIE: A Bilingual Instruction-based Information Extraction Dataset

HongHao Gui , Shuofei Qiao , Jintian Zhang , Hongbin Ye Mengshu Sun Lei Liang Huajun Chen Ningyu Zhang♠*

Zhejiang University Zhejiang Lab Ant Group
*Corresponding Author

Overview of InstructIE dataset construction via KG2Instructions. (a) Identifying all entity mentions. (b) Disambiguating entity mentions to obtain their corresponding unique Wikidata ID. (c) Matching relationships for each entity pair. (d) Removing irrelevant triplets by Schema Constraint. (e) Rule Refinement and Filtering.

Abstract

Traditional information extraction (IE) methodologies, constrained by pre-defined classes and static training paradigms, often falter in adaptability, especially in the dynamic world. To bridge this gap, we attempt to adopt an Instruction-based IE paradigm in this paper. Yet, we observe that most existing IE datasets tend to be overly redundant in their label sets, leading to instructions often involving numerous labels not directly relevant to the extraction content. To tackle this issue, we introduce a bilingual theme-centric IE instruction dataset, InstructIE, and for the first time, incorporate the design of a theme scheme, effectively simplifying the label structure. Furthermore, we develop an innovative framework named KG2Instruction, specifically designed for the automatic generation of such datasets. Experimental evaluations based on InstructIE reveal that while current models already show promise in Instruction-based IE tasks, opportunities for their potential optimization also emerge.



Framework Design

Comparison of traditional approaches with Instruction-based IE in handling emergent classes (unseen during training).


  • Although traditional approaches possess unique capabilities, a notable limitation is their inherent constraint to pre-defined classes, coupled with a once-and-for-all training pattern. Such inflexibility hampers adaptability, especially in the ever-evolving real world that demands more scalable solutions.

  • In an ideal scenario, the Information Extraction (IE) system should be capable of interpreting natural language instructions and producing the expected responses accordingly. As user requirements evolve, real-time feedback can be achieved by adjusting the instructions, which in turn guides the model to adapt its behavior. This innovative paradigm is denoted as Instruction-based IE.
Illustration of Instruction-based IE instructions include Instruction warehouse containing multiple instruction forms for different tasks, Schema warehouse containing some of the most relevant labels under each theme, and Format warehouse containing multiple output formats for different tasks.
Classification of entity types, categorized into 14 kinds. The inner ring represents abstract entity type divisions, while the outer ring delineates more specific entity type subdivisions.


Main Results

Our experimental design seeks to systematically investigate the efficacy and applicability of diverse methodologies within the realm of Instruction-based IE. Central to this inquiry are several strategies: (1) Zero-shot learning, (2) In-context learning, (3) Fine-tuning (including QLora).

We utilize a blend of span-based micro-F1 and the rouge-2 score, expressed as Score = 0.5 × F1 + 0.5 × rouge-2.

  • Zero-shot Learning Performance. In Instruction-based IE tasks, even for LLMs like ChatGPT, zero-shot learning still faces challenges. Evidently, the open-source 13B model lags considerably behind ChatGPT in zero-shot learning, emphasizing its deficits in instruction comprehension and knowledge representation.

  • Few-shot Learning Performance. ChatGPT registers improvements of ↑4.46 and ↑10.89 in Chinese and English datasets, respectively. Parallelly, other models also post substantial improvements, effectively narrowing the gap with ChatGPT. Such results indicate the capacity of these models to internalize instructions from contextual examples and generate outputs accordingly.

  • Fine-tuning Performance. Distinctly, Baichuan2-13B-Base stands out in Chinese tasks, and Vicuna-v1.5-13B-16k excels in English tasks, achieving high scores of 59.93 and 49.96 respectively. When juxtaposing MT5-Base with its peers, it becomes apparent that fine-tuning a specific subset of parameters within larger models yields superior performance compared to comprehensive tuning in smaller ones. This observation might be attributed to QLora, which seemingly guides models to tune format sub-distribution in their outputs, rather than innate knowledge.

  • Model Size and Its Impact. When contrasting the 7B and 13B iterations of both Baichuan2 and Llama2, it becomes evident that the model’s size is intrinsically linked to its performance in instruction-based IE tasks. Additionally, a comparative analysis involving KnowLM, Baichuan2, and Vicuna underscores an essential insight: even among models of the same size, the foundational performance of the base model exerts a profound influence on the output.



Analysis

Upon a thorough analysis of the model’s predictions, we identify that the majority of errors can be categorized into the following four groups: (a) Entity Mismatch.: Within a given triplet, either only the head entity or only the tail entity fails to align with the gold standard, while all other components remain accurate. (b) Spurious Relation.: The model produces relations not reflected in the gold standard, indicating potential over-generation or hallucination. (c) Boundary Mismatch. Boundary Mismatch: The predicted head or tail entity depiction partially overlaps with the gold standard but fails to capture it entirely. (d) Incongruent Predictions.: Several components of the prediction fail to align, rendering the output akin to arbitrary generation.

The introduction of contextual examples significantly diminishes the error rate in the "Spurious Relation" category. Nevertheless, in tandem, there is a discernible rise in error rates for "Entity Mismatch", "Boundary Mismatch", and "Incongruent Predictions". We posit that these contextual examples enhance the model’s capacity to understand instructions, subsequently enabling it to focus more on the extraction and articulation of relations in alignment with the provided instructions.

  • Generalization to Unseen Theme. we exclude all data related to the three major themes of "Natural Science", "Medicine", and "Event" from the InstructIE-zh dataset and train the Baichuan2-7B-Base model based on this variation. The average performance of this model on these three unseen themes surpasses ChatGPT’s performance in 5-shot in-context learning, with a specific advantage of ↑1.17. Such results intimate that, through specialized instruction tuning, the model not only adapts but also excels in themes it has not encountered before.

  • Influence of Instructional Design. We standardize the instruction format across all data and train it on the Baichuan2-7B-Base, producing a model variant named SINGLE. While its precision rises, its recall and F1 score drop by ↓3.7 and ↓0.89 respectively. These metrics intimate that, in the presence of diverse instructions, the model exhibits enhanced robust ness in instruction interpretation and tends to assimilate information more holistically.

  • No output vs output "NAN". For relationships that exist in instructions but do not exist in actual sentences, the model might adopt two strategies: either proffer no output or output "NAN" to explicitly state the relationship doesn’t exist. Based on this, we train another model variant, called W/ NAN, which explicitly mandates a "NAN" response in its instructions. Experimental results reveal that as recall remains stable, the model’s precision substantially increases by ↑10.58, leading to an F1 boost of ↑4.07. This confirms that by encouraging the model to explicitly output "NAN" to indicate "non-existence", it effectively reduces the risk of the model producing spurious relationships.


BibTeX


      @article{DBLP:journals/corr/abs-2305-11527,
        author       = {Honghao Gui and
                        Jintian Zhang and
                        Hongbin Ye and
                        Ningyu Zhang},
        title        = {InstructIE: {A} Chinese Instruction-based Information Extraction Dataset},
        journal      = {CoRR},
        volume       = {abs/2305.11527},
        year         = {2023},
        url          = {https://doi.org/10.48550/arXiv.2305.11527},
        doi          = {10.48550/arXiv.2305.11527},
        eprinttype    = {arXiv},
        eprint       = {2305.11527},
        timestamp    = {Thu, 25 May 2023 15:41:47 +0200},
        biburl       = {https://dblp.org/rec/journals/corr/abs-2305-11527.bib},
        bibsource    = {dblp computer science bibliography, https://dblp.org}
      }

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.