赞
踩
LLMs之AutoKG:《大型语言模型在知识图谱构建和推理中的应用:近期能力与未来机遇LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future Opportunities》翻译与解读
目录
Generalizability Analysis概括性的分析
2 Recent Capabilities of LLMs for KG Construction and Reasoning—LLMs在KG构建和推理方面的最新能力
2.2 KG Construction and Reasoning—KG的构造和推理
2.3 Discussion: Why LLMs do not present satisfactory performance on some tasks?讨论:为什么LLMs在某些任务上表现不佳?
3 Future Opportunities: Automatic KG Construction and Reasoning未来机遇:自动KG构建和推理
4 Conclusion and Future Work结论与未来工作
Ethical and Factual Considerations道德和事实考虑
地址 | |
时间 | 2023年5月22日 最近更新:2023年2月22日 |
作者 | 浙江大学团队等 |
总结 | 背景痛点:知识图谱的构建和推理是一个很复杂的任务,需要大量人力成本。 解决方案:提出了AutoKG方法,利用多个智能代理来协同完成知识图谱的构建和推理任务。其中,每个代理扮演不同的角色,如知识图谱助理、知识图谱用户等,通过协作来完成 tasks。同时结合外部知识库和互联网资源来弥补语言模型本身知识的限制。 核心特点: >> 利用多智能体协作,分配不同角色来完成知识图谱任务,提高效率。 >> 结合外部知识库和互联网资源,弥补语言模型本身知识限制,生成更全面和准确的知识图谱。 >> 通过人机协作的方式,监督语言模型并纠正错误,生成的知识图谱质量更高。 优势: >> 简化知识图谱构建过程,提高效率。 >> 弥补语言模型知识限制,生成更全面高质量的知识图谱。 >> 通过人机互动监督,提高语言模型在知识图谱领域应用的透明度和事实正确性。 >> 利用不同角色协作,深入理解语言模型在决策过程中的工作机制。 总之,AutoKG方法通过多智能体协作和结合外部资源,很好地解决了语言模型单独完成知识图谱任务时的知识限制问题,有效简化并提高知识图谱构建质量。 |
This paper presents an exhaustive quantitative and qualitative evaluation of Large Language Models (LLMs) for Knowledge Graph (KG) construction and reasoning. We engage in ex-periments across eight diverse datasets, focus-ing on four representative tasks encompass-ing entity and relation extraction, event extrac-tion, link prediction, and question-answering, thereby thoroughly exploring LLMs’ perfor-mance in the domain of construction and in-ference. Empirically, our findings suggest that LLMs, represented by GPT-4, are more suited as inference assistants rather than few-shot in-formation extractors. Specifically, while GPT- 4 exhibits good performance in tasks related to KG construction, it excels further in rea-soning tasks, surpassing fine-tuned models in certain cases. Moreover, our investigation ex-tends to the potential generalization ability of LLMs for information extraction, leading to the proposition of a Virtual Knowledge Extraction task and the development of the correspond-ing VINE dataset. Based on these empirical findings, we further propose AutoKG, a multi-agent-based approach employing LLMs and external sources for KG construction and rea-soning. We anticipate that this research can provide invaluable insights for future undertak-ings in the field of knowledge graphs 1. | 本文全面定量化和定性评估了大型语言模型(LLM)对于知识图谱(KG)构建和推理的能力。我们在八个多样化的数据集上进行实验,关注四个代表性的任务,包括实体和关系提取、事件提取、链接预测和问答,从而彻底探讨了 LLM 在构建和推理领域的表现。从实验结果来看,以 GPT-4 为代表的 LLM 更适合作为推理助手,而不是少样本信息提取器。具体来说,虽然 GPT-4 在与 KG 构建相关的任务中表现良好,但在推理任务中表现更为出色,在某些情况下超过了微调模型。此外,我们的研究还扩展到了 LLM 信息提取的潜在泛化能力,提出了虚拟知识提取任务,并开发了相应的 VINE 数据集。基于这些实验结果,我们进一步提出了 AutoKG,一种采用 LLM 和外部资源的多代理方法用于 KG 构建和推理。我们期望这项研究能为未来知识图谱领域的研究提供宝贵的洞见。 |
Knowledge Graph (KG) is a semantic network com-prising entities, concepts, and relations (Cai et al., 2022; Zhu et al., 2022; Liang et al., 2022; Chen et al., 2023; Pan et al., 2023b,a), which can catal-yse applications across various scenarios. Con-structing KGs (Ye et al., 2022b) typically involves multiple tasks such as Named Entity Recognition (NER) (Chiu and Nichols, 2016), Relation Extrac-tion (RE) (Zeng et al., 2015; Chen et al., 2022), Event Extraction (EE) (Chen et al., 2015; Deng et al., 2020), and Entity Linking (EL) (Shen et al., 2015). Additionally, Link Prediction (LP) (Zhang et al., 2018; Rossi et al., 2021) is a crucial step for KG reasoning, essential for understanding con-structed KGs. These KGs also hold a central posi-tion in Question Answering (QA) tasks (Karpukhin et al., 2020; Zhu et al., 2021), especially in con-ducting inference based on question context, in-volving the construction and application of relation subgraphs. This paper focuses on empirically in-vestigating the potential applicability of LLMs (Liu et al., 2023; Shakarian et al., 2023; Lai et al., 2023; Zhao et al., 2023b), exemplified by ChatGPT and GPT-4 (OpenAI, 2023). By comprehending the fundamental capabilities of LLMs, our study fur-ther delves into potential future directions. | 知识图(KG)是一个包含实体、概念和关系组成的语义网络(Cai et al., 2022;Zhu et al., 2022;Liang et al., 2022;Chen et al., 2023;Pan等人,2023b,a),它可以催化各种场景的应用。构建KGs (Ye et al., 2022b)通常涉及多个任务,如命名实体识别(NER) (Chiu and Nichols, 2016)、关系提取(RE) (Zeng et al., 2015;Chen et al., 2022),事件提取(EE) (Chen et al., 2015;Deng et al., 2020)和实体链接(EL) (Shen et al., 2015)。此外,链路预测(LP) (Zhang et al., 2018;Rossi等人,2021)是KG推理的关键步骤,对于理解构建的KG至关重要。这些KG在问答(QA)任务中也占有中心地位(Karpukhin等人,2020;Zhu et al., 2021),特别是在基于问题上下文进行推理方面,涉及关系子图的构建和应用。本文侧重于实证研究LLMs的潜在适用性(Liu et al., 2023;Shakarian et al., 2023;Lai et al., 2023;Zhao et al., 2023b),以ChatGPT和GPT-4为例(OpenAI, 2023)。通过了解LLMs的基本能力,我们的研究进一步探讨了潜在的未来方向。 |
Recent Capabilities. Entity and Relation Extrac-tion and Event Extraction serve as foundational elements for constructing knowledge graphs, fa-cilitating the refinement of a wealth of entity, re-lation, and event information. Meanwhile, Link Prediction, as a core task of KG reasoning, aims to uncover latent relationships between entities, thereby enriching the knowledge graph. Addition-ally, we further explore the application of LLMs in knowledge-based Question Answering tasks to gain a comprehensive understanding of their in-ferential skills. Given these considerations, we select these tasks as representatives for evaluating both the construction and reasoning of KGs. As illustrated in Figure 1, our initial investigation tar-gets the zero-shot and one-shot abilities of large language models across the aforementioned tasks. This analysis serves to assess the potential usage of such models in the field of knowledge graphs. The empirical findings reveal that LLMs like GPT-4 ex-hibit limited effectiveness as a few-shot information extractor, yet demonstrate considerable proficiency as an inference assistant. | 最近的能力。实体和关系抽取以及事件抽取是构建知识图的基础元素,有助于提炼大量实体、关系和事件信息。同时,链接预测作为KG推理的核心任务,旨在揭示实体之间的潜在关系,从而丰富知识图谱。此外,我们进一步探索LLMs在基于知识的问答任务中的应用,以全面了解他们的推理技能。考虑到这些因素,我们选择这些任务作为评估kg的构建和推理的代表。如图1所示,我们最初的调查目标是在上述任务中获得大型语言模型的零射击和一次射击能力。这种分析有助于评估这些模型在知识图谱领域的潜在用途。实证结果表明,像GPT-4这样的LLM作为少量信息提取器的有效性有限,但作为推理助手却表现出相当的熟练程度。 |
Generalizability Analysis. To delve deeper into the behavior of LLMs in information extraction tasks, we devise a unique task termed Virtual Knowledge Extraction. This undertaking aims to discern whether the observed performance en-hancements on these tasks are attributed to the ex-tensive internal knowledge repositories of LLMs or to their potent generalization capabilities facilitated by instruction tuning and Reinforcement Learn-ing from Human Feedback (RLHF) (Christiano et al., 2017). And our experiments on a newly constructed dataset, VINE, indicate that large lan-guage models like GPT-4 can acquire new knowl-edge from instructions and effectively execute ex-traction tasks, thereby affording a more nuanced understanding of large models to a certain extent. | 概括性分析。为了更深入地研究LLMs在信息提取任务中的行为,我们设计了一个独特的任务,称为虚拟知识提取。这项工作的目的是辨别在这些任务上观察到的性能提升是归因于LLM广泛的内部知识库,还是归因于它们通过指令调整和人类反馈强化学习(RLHF)促进的强大泛化能力(Christiano et al., 2017)。我们在一个新构建的数据集VINE上的实验表明,像GPT-4这样的大型语言模型可以从指令中获取新的知识边缘,并有效地执行抽取任务,从而在一定程度上对大型模型有了更细致的理解。 |
Future Opportunities. In light of the preced-ing experiments, we further examine prospective directions for knowledge graphs. Given the remark-able generalization capabilities of large models, we opt to employ them to aid in the construction of KG. Compared to smaller models, these LLMs mitigate potential resource wastage and demon-strate notable adaptability in novel or data-scarce situations. However, it’s important to recognize their strong dependence on prompt engineering and the inherent limitations of their knowledge cutoff. Consequently, researchers are exploring interactive mechanisms that allow LLMs to access and lever-age external resources, aiming to enhance their performance further (Wang et al., 2023b). | 未来的机会 未来的机会。根据前面的实验,我们进一步研究了知识图的前景方向。鉴于大型模型的显著泛化能力,我们选择使用它们来帮助构建KG。与较小的模型相比,这些LLM减少了潜在的资源浪费,并在新的或数据稀缺的情况下表现出显著的适应性。然而,重要的是要认识到他们对即时工程的强烈依赖以及他们的知识截止的固有局限性。因此,研究人员正在探索允许LLMs访问和利用外部资源的互动机制,旨在进一步提高其绩效(Wang et al., 2023b)。 |
On this basis, we introduce the concept of Au-toKG - autonomous KG construction and reason-ing via multi-agent communication. In this frame-work, the human role is diminished, with multiple communicative agents each playing their respective roles. These agents interact with external sources, collaboratively accomplishing the task. We sum-marize our contributions as follows: >> We evaluate LLMs, including ChatGPT and GPT-4, offering an initial understanding of their capabilities by evaluating their zero-shot and one-shot performance on KG construction and reasoning on eight benchmark datasets. >> We design a novel Virtual Knowledge Extrac-tion task and construct the VINE dataset. By evaluating the performance of LLMs on it, we further demonstrate that LLMs such as GPT-4 possess strong generalization abilities. >> We introduce the concept of automatic KG construction and reasoning, known as Au-toKG. Leveraging LLMs’ inner knowledge, we enable multiple agents of LLMs to assist in the process through iterative dialogues, pro-viding insights for future research. | 在此基础上,我们引入了Au-toKG的概念——通过多智能体通信进行自主KG构建和推理。在这个框架中,人的角色被削弱,多个沟通主体各自扮演各自的角色。这些代理与外部源交互,协同完成任务。我们将我们的贡献总结如下: 我们对包括ChatGPT和GPT-4在内的LLM进行了评估,通过在8个基准数据集上评估它们在KG构建和推理方面的零射击和单射击性能,初步了解了它们的能力。 >>我们设计了一种新的虚拟知识提取任务,并构建了VINE数据集。通过评价LLM在其上的性能,我们进一步证明了GPT-4等LLM具有较强的泛化能力。 >>我们引入了自动构建和推理KG的概念,称为Au-toKG。利用LLMs的内部知识,我们使LLMs的多个代理人通过迭代对话来协助这一过程,为未来的研究提供见解。 |
The release of large language models like GPT- 4, recognized for their remarkable general ca-pabilities, has been considered by researchers as the spark of artificial general intelligence (AGI) (Bubeck et al., 2023). To facilitate an in-depth understanding of their performance in KG-related tasks, a series of evaluations are conducted. 2.1 introduces the evaluation principles, followed by a detailed analysis in §2.2 on the performance of LLMs in the construction and reasoning tasks, highlighting variations across different datasets and domains. Moreover, §2.3 delves into the reasons underlying the subpar performance of LLMs in certain tasks. And finally, §2.4 discusses whether the models’ performance is genuinely indicative of generalization abilities or influenced by inherent advantages of the knowledge base. | 像GPT- 4这样的大型语言模型的发布,因其卓越的通用能力而得到认可,被研究人员认为是人工通用智能(AGI)的火花(Bubeck et al., 2023)。为了深入了解他们在执行与体重有关的工作时的表现,我们进行了一系列的评估。2.1介绍了评估原则,随后在§2.2中对LLMs在构建和推理任务中的性能进行了详细分析,突出了不同数据集和领域的差异。此外,§2.3深入探讨了LLMs在某些任务中表现不佳的原因。最后,§2.4讨论了模型的性能是否真正表明了泛化能力,还是受到知识库固有优势的影响。 |
In contemplating the trajectory of Knowledge Graph, the pronounced merits of large language models become evident. They not only optimize resource utilization but also outperform smaller models in adaptability, especially in varied appli-cation domains and data-limited settings. Such strengths position them as primary tools for KG construction and reasoning. Yet, while the prowess of LLMs is impressive, researchers have identi-fied certain limitations, such as misalignment with human preferences and the tendency for hallucina-tions. The efficacy of models like ChatGPT heavily leans on human engagement in dialogue generation. Further refining model responses necessitates intri-cate user task descriptions and enriched interaction contexts, a process that remains demanding and time-intensive in the development lifecycle. | 在思考知识图谱的发展轨迹时,大型语言模型的显著优点变得显而易见。它们不仅优化了资源利用,而且在适应性方面优于较小的模型,特别是在不同的应用领域和数据有限的设置中。这些优势使它们成为KG构建和推理的主要工具。然而,尽管LLMs的能力令人印象深刻,但研究人员也发现了一些局限性,比如与人类偏好不一致,以及产生幻觉的倾向。ChatGPT等模型的有效性在很大程度上依赖于人类在对话生成中的参与。进一步细化模型响应需要复杂的用户任务描述和丰富的交互上下文,这是一个在开发生命周期中仍然要求很高且时间密集的过程。 |
Consequently, there is a growing interest in the realm of interactive natural language processing (iNLP) (Wang et al., 2023b). In parallel, research efforts concerning intelligent agents continue to proliferate (Wang et al., 2023a; Xi et al., 2023; Zhao et al., 2023a). A notable example of this advancement is AutoGPT4, which can independently generate prompts and carry out tasks such as event analysis, programming, and mathematical opera-tions. Concurrently, Li et al. (2023) delves into the potential for autonomous cooperation between communicative agents and introduces a novel co-operative agent framework called role-playing. In light of our findings, we propose the use of communicative intelligent agents for KG construc-tion, leveraging different roles assigned to multiple agents to collaborate on KG tasks based on their mutual knowledge. Considering the knowledge cut-off prevalent in large models during the pre-training phase, we suggest the incorporation of external sources to assist task completion. These sources can include knowledge bases, existing KGs, and internet retrieval systems, among others. Here we name this AutoKG. | 因此,人们对交互式自然语言处理(iNLP)领域的兴趣日益浓厚(Wang et al., 2023b)。与此同时,关于智能体的研究也在不断增加(Wang et al., 2023a;Xi et al., 2023;赵等,20023a)。AutoGPT4是这种进步的一个值得注意的例子,它可以独立地生成提示并执行诸如事件分析、编程和数学运算之类的任务。同时,Li等人(2023)深入研究了沟通代理之间自主合作的潜力,并引入了一种称为角色扮演的新型合作代理框架。 根据我们的研究结果,我们建议使用通信智能代理进行KG构建,利用分配给多个代理的不同角色基于他们的相互知识来协作KG任务。考虑到在预训练阶段在大型模型中普遍存在的知识切断,我们建议结合外部资源来协助任务完成。这些资源可以包括知识库、现有的KGs和internet检索系统等。这里我们给它命名为AutoKG。 |
For a simple demonstration of the concept, we utilize the role-playing method in CAMEL (Li et al., 2023). As depicted in Figure 6, we designate the KG assistant agent as a Consultant and the KG user agent as a KG domain expert. Upon receipt of the prompt and assigned roles, the task-specifier agent provides an elaborate description to clarify the concept. Following this, the KG assistant and KG user collaborate in a multi-party setting to com-plete the specified task until the KG user confirms its completion. Concurrently, a web searcher role is introduced to aid the KG assistant in internet knowledge retrieval. When the KG assistant receives a dialogue from the KG user, it initially consults the web searcher on whether to browse information online based on the content. Guided by the web searcher’s response, the KG assistant then continues to address the KG user’s command. The experimental example indicates that the knowl-edge graph related to the film Spider-Man: Across the Spider-Verse released in 2023 is more effec-tively and comprehensively constructed using the multi-agent and internet-augmented approach. | 为了简单地演示这一概念,我们在CAMEL中使用了角色扮演方法(Li et al., 2023)。如图6所示,我们将KG助理代理指定为顾问,将KG用户代理指定为KG领域专家。在收到提示和分配的角色后,任务说明符代理提供详细的描述来澄清概念。在此之后,KG助手和KG用户在多方设置中协作完成指定的任务,直到KG用户确认完成为止。同时,本文还引入了一个网络搜索者角色来辅助知识检索助手进行网络知识检索。当KG助手收到来自KG用户的对话时,它首先根据内容向web搜索者咨询是否在线浏览信息。在网络搜索者响应的引导下,KG助手继续处理KG用户的命令。实验实例表明,采用多智能体和互联网增强的方法可以更有效、更全面地构建与2023年上映的电影《蜘蛛侠:穿越蜘蛛侠》相关的知识边缘图。 |
Remark. By combining the efforts of artificial intelligence and human expertise, AutoKG could speed up the creation of specialized KGs, fostering a collaborative environment with language models. This system leverages domain and internet knowl-edge to produce high-quality KGs, augmenting the factual accuracy of LLMs in domain-specific tasks, thereby increasing their practical utility. AutoKG not only simplifies the construction process but also improves LLMs’ transparency, facilitating a deeper understanding of their internal workings. As a cooperative human-machine platform, it bol-sters the understanding and guidance of LLMs’ decision-making, increasing their efficiency in com-plex tasks. However, it is noteworthy that despite the assistance of AutoKG, the current results of the constructed knowledge graph still necessitate manual evaluation and validation. | 通过结合人工智能和人类专业知识的努力,AutoKG可以加快专业kg的创建,培养语言模型的协作环境。该系统利用领域和互联网知识来生成高质量的知识库,提高LLMs在特定领域任务中的事实准确性,从而提高其实际效用。AutoKG不仅简化了施工过程,而且提高了LLM的透明度,促进了对其内部工作的更深入了解。作为一个人机协作平台,它加强了LLMs决策的理解和指导,提高了他们在复杂任务中的效率。然而,值得注意的是,尽管有AutoKG的帮助,目前构建的知识图的结果仍然需要人工评估和验证。 |
Furthermore, three significant challenges remain when utilizing AutoKG, necessitating further re-search and resolution: The utilization of the API is constrained by a maximum token limit. Cur-rently, the gpt-3.5-turbo in use is subjected to a max token restriction. This constraint impacts the construction of KGs. AutoKG now exhibits short-comings in facilitating efficient human-machine interaction. In fully autonomous machine opera-tions, human oversight for immediate error correc-tion is lacking, yet incorporating human involve-ment in every step will increase time and labor costs substantially. Hallucination problem of LLMs. Given the known propensity of LLMs to generate non-factual information, it’s imperative to scrutinize outputs from them. This can be achieved via comparison with standard answers, expert re-view, or through semi-automatic algorithms. | 此外,在使用AutoKG时仍然存在三个重大挑战,需要进一步研究和解决:API的使用受到最大令牌限制的约束。目前,使用中的gpt-3.5 turbo受到最大令牌限制。这一限制影响了kg的构建。AutoKG现在在促进有效的人机交互方面显示出缺点。在完全自主的机器操作中,缺乏对即时错误纠正的人工监督,然而将人工参与每一步将大大增加时间和劳动力成本。LLMs的幻觉问题。考虑到LLMs产生非事实信息的已知倾向,审查它们的输出是必要的。这可以通过与标准答案的比较、专家评审或半自动算法来实现。 |
In this paper, we investigate LLMs for KG construc-tion and reasoning. We question whether LLMs’ extraction abilities arise from their vast pre-training corpus or their strong contextual learning capa-bilities. To investigate this, we conduct a Virtual Knowledge Extraction task using a novel dataset, with results highlighting the LLMs’ robust contex-tual learning. Furthermore, we propose an innova-tive method of AutoKG for accomplishing KG con-struction and reasoning tasks by employing multi-ple agents. In the future, we would like to extend our work to other LLMs and explore additional KG-related tasks, such as multimodal reasoning. | 在本文中,我们研究了用于KG构建和推理的LLM。我们质疑LLMs的提取能力是来自他们庞大的预训练语料库还是他们强大的上下文学习能力。为了研究这一点,我们使用一个新的数据集进行了一个虚拟知识提取任务,结果突显了 LLM 的强大上下文学习能力。此外,我们提出了一种创新的AutoKG方法,通过使用多智能体来完成KG的构建和推理任务。在未来,我们希望将我们的工作扩展到其他LLMs,并探索更多与KG相关的任务,如多模态推理。 |
While our research has yielded some results, it also possesses certain limitations. As previously stated, the inability to access the GPT-4 API has necessitated our reliance on an interactive interface for conducting experiments, undeniably inflating workload and time costs. We look forward to future research opportunities that will allow us to further explore these areas. | 虽然我们的研究取得了一些成果,但也有一定的局限性。如前所述,由于无法访问GPT-4 API,我们必须依赖交互式界面来进行实验,这无疑会增加工作量和时间成本。我们期待着未来的研究机会,这将使我们进一步探索这些领域。 |
LLMs. We confine our experiments to models within the GPT series, leaving the performance of other large models like LaMDA (Thoppilan et al., 2022) unexamined. Future work could extend these experiments to more LLMs. Additionally, we do not have access to the GPT-4 API; thus, we com-plete our experiments via an interactive interface, which is both time-consuming and labor-intensive. | LLM。我们将实验局限于GPT系列中的模型,而忽略了LaMDA等其他大型模型的性能(Thoppilan et al., 2022)。未来的工作可能会将这些实验扩展到更多的LLMs。此外,我们无法访问GPT-4 API;因此,我们通过一个交互界面来完成我们的实验,这既耗时又费力。 |
Tasks. Not all KG construction and reasoning tasks are considered in our study. We focus on a handful of representative tasks, which might limit the applicability of our findings in specific con-texts. Also, due to the unavailability of GPT-4’s multimodal capabilities to the public, we are un-able to delve into its performance and contribution to multimodal processing. We look forward to fu-ture research opportunities that would allow us to explore these areas further. | 任务。在我们的研究中并没有考虑到所有的KG构建和推理任务。我们专注于少数具有代表性的任务,这可能会限制我们的研究结果在特定背景下的适用性。此外,由于GPT-4的多模式功能无法向公众开放,我们无法深入研究其性能和对多模式处理的贡献。我们期待着未来的研究机会,使我们能够进一步探索这些领域。 |
Large language models used in our experiments may have inherent biases and issues related to factual accuracy. Thus, the experimental results should be interpreted with a critical mindset. | 在我们的实验中使用的大型语言模型可能存在与事实准确性相关的固有偏差和问题。因此,应该用批判的心态来解释实验结果。 |
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。