【精选优质专栏推荐】


每个专栏均配有案例与图文讲解,循序渐进,适合新手与进阶学习者,欢迎订阅。

在这里插入图片描述

引言

大型语言模型(LLMs)在问答、翻译、摘要等众多应用中展现出强大能力,随着该领域的不断发展,其潜力进一步增强。然而,LLMs 有时会生成事实错误的答案,尤其当训练数据中缺少与输入对应的内容时。这种现象被称为“幻觉”。

为缓解幻觉问题,研究者提出了检索增强生成(RAG)。该方法通过从知识库中检索数据,辅助模型生成更可靠的回答。尽管如此,RAG 仍可能出现幻觉,因此在 RAG 系统中检测幻觉并采取相应处理措施,显得尤为重要。

在现代 LLM 系统中,输出的可信度是关键指标,这使得幻觉检测和应对比以往更加重要。

RAG 的基本流程是:通过稀疏或稠密检索从知识库获取信息,将最相关的结果与用户输入一同传入 LLM,生成最终回答。然而,输出中仍可能因多种原因出现幻觉:

  • LLMs 获取了正确的信息,但未能生成正确回答,常见于需要在已有信息基础上进行复杂推理的场景。
  • 检索结果存在错误,或未覆盖关键信息,此时 LLM 可能会强行作答并产生幻觉。

本文将聚焦于检测 RAG 系统生成的响应,而非改进检索环节。我们将探讨几种幻觉检测技术,以帮助构建更可靠的 RAG 系统。

幻觉评估指标

我们首先尝试使用 DeepEval 库中的幻觉评估指标。幻觉指标是一种简单的方法,用于通过比较判断模型生成的信息是否真实、正确。其计算方式为:

上下文矛盾数量 ÷ 上下文总数

下面通过代码示例来演示。

安装 DeepEval

pip install deepeval

评估依赖于用于判定结果的 LLM,这意味着我们需要一个模型作为评估器。本示例中使用 DeepEval 默认的 OpenAI 模型。若要更换其他 LLM,可参考相关文档。同时需要提供 OpenAI API Key:

import os
os.environ["OPENAI_API_KEY"] = "YOUR-API-KEY"

设置上下文与实际输出

安装好库后,我们来检测 LLM 输出中是否存在幻觉。

首先设置上下文(即输入中应当存在的事实),并准备模型的实际输出,作为检测对象:

from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase
 
context = [
    "The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and other materials, "
    "generally built along an east-to-west line across the historical northern borders of China to protect the Chinese states "
    "and empires against the raids and invasions of the nomadic groups of the Eurasian Steppe."
]
 
actual_output = (
    "The Great Wall of China is made entirely of gold and was built in a single year by the Ming Dynasty to store treasures."
)

设置测试用例与幻觉指标

阈值(threshold)表示可容忍的幻觉程度。

如果要求严格不允许幻觉,可将阈值设为 0:

test_case = LLMTestCase(
    input="What is the Great Wall of China made of and why was it built?",
    actual_output=actual_output,
    context=context
)
 
halu_metric = HallucinationMetric(threshold=0.5)

运行测试并查看结果

halu_metric.measure(test_case)
print("Hallucination Metric:")
print("  Score: ", halu_metric.score)
print("  Reason: ", halu_metric.reason)

输出示例

Hallucination Metric:
  Score:  1.0
  Reason:  The score is 1.00 because the actual output contains significant contradictions with the context, such as incorrect claims about the materials and purpose of the Great Wall of China, indicating a high level of hallucination.

结果显示幻觉评分为1.0,表示输出内容完全为幻觉。

此外,DeepEval 还会提供原因说明,帮助定位幻觉点。

G-Eval

G-Eval 是一种利用大语言模型(LLM)结合思维链(CoT)方法来自动评估 LLM 输出的框架。核心思想是:基于预先设定的多步骤标准,对模型输出进行逐步判断。通过 DeepEval 提供的 G-Eval 框架以及自定义评估标准,可以测试 RAG 的输出能力,并识别其中是否存在幻觉。

在 G-Eval 中,需要根据评估目标和步骤自行设定指标。以下是框架的配置示例:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
 
correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually accurate, logically consistent, and sufficiently detailed based on the expected output.",
    evaluation_steps=[
        "Check if the 'actual output' aligns with the facts in 'expected output' without any contradictions.",
        "Identify whether the 'actual output' introduces new, unsupported facts or logical inconsistencies.",
        "Evaluate whether the 'actual output' omits critical details needed to fully answer the question.",
        "Ensure that the response avoids vague or ambiguous language unless explicitly required by the question."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
)

接着,构建测试用例来模拟 RAG 流程。需要设置用户输入、模型生成的输出、期望的正确输出,以及检索到的上下文:

from deepeval.test_case import LLMTestCase
 
test_case = LLMTestCase(
    input="When did the Apollo 11 mission land on the moon?",
    actual_output="Apollo 11 landed on the moon on July 21, 1969, marking humanity's first successful moon landing.",
    expected_output="Apollo 11 landed on the moon on July 20, 1969, marking humanity's first successful moon landing.",
    retrieval_context=[
        """The Apollo 11 mission achieved the first successful moon landing on July 20, 1969.
        Astronauts Neil Armstrong and Buzz Aldrin spent 21 hours on the lunar surface, while Michael Collins orbited above in the command module."""
    ]
)

运行 G-Eval 测试:

correctness_metric.measure(test_case)
 
print("Score:", correctness_metric.score)
print("Reason:", correctness_metric.reason)

输出示例:

Score: 0.7242769207695651
Reason: The actual output provides the correct description but has an incorrect date, contradicting the expected output

由此可见,G-Eval 能够检测出 RAG 生成回答中的幻觉(日期错误),并提供合理解释。其官方文档还包含关于分数计算方式的详细说明。

忠实度指标(Faithfulness Metric)

如果需要更量化的指标,可以使用 RAG 专用指标来测试检索过程的有效性。这类指标中包含一个专门检测幻觉的指标——忠实度(Faithfulness)。

DeepEval 提供五个 RAG 专用指标:

  • Contextual precision
    评估重排序器的精度。

  • Contextual recall
    评估嵌入模型是否准确检索到相关信息。

  • Contextual relevancy
    评估文本分块大小与 top-K 参数的合理性。

  • Contextual answer relevancy
    评估提示是否能引导 LLM 生成相关答案。

  • Faithfulness
    评估 LLM 输出是否无幻觉,且不与检索信息相矛盾。

与前文的幻觉检测不同,这些指标更关注 RAG 的检索与输出质量。

以下以 Apollo 11 的示例测试:

from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric,
    FaithfulnessMetric
)
 
contextual_precision = ContextualPrecisionMetric()
contextual_recall = ContextualRecallMetric()
contextual_relevancy = ContextualRelevancyMetric()
answer_relevancy = AnswerRelevancyMetric()
faithfulness = FaithfulnessMetric()
 
contextual_precision.measure(test_case)
print("Contextual Precision:")
print("  Score: ", contextual_precision.score)
print("  Reason: ", contextual_precision.reason)
 
contextual_recall.measure(test_case)
print("\nContextual Recall:")
print("  Score: ", contextual_recall.score)
print("  Reason: ", contextual_recall.reason)
 
contextual_relevancy.measure(test_case)
print("\nContextual Relevancy:")
print("  Score: ", contextual_relevancy.score)
print("  Reason: ", contextual_relevancy.reason)
 
answer_relevancy.measure(test_case)
print("\nAnswer Relevancy:")
print("  Score: ", answer_relevancy.score)
print("  Reason: ", answer_relevancy.reason)
 
faithfulness.measure(test_case)
print("\nFaithfulness:")
print("  Score: ", faithfulness.score)
print("  Reason: ", faithfulness.reason)

输出示例:

Contextual Precision:
  Score:  1.0
  Reason:  The score is 1.00 because the node in the retrieval context perfectly matches the input with accurate and relevant information.
 
Contextual Recall:
  Score:  1.0
  Reason:  The score is 1.00 because every detail in the expected output is perfectly supported by the retrieval context.
 
Contextual Relevancy:
  Score:  0.5
  Reason:  The score is 0.50 because while the retrieval context contains the relevant date 'July 20, 1969', other details are less relevant.
 
Answer Relevancy:
  Score:  1.0
  Reason:  The response directly answered the question without irrelevant content.
 
Faithfulness:
  Score:  0.5
  Reason:  The actual output gave July 21, 1969, contradicting the retrieval context which states July 20, 1969.

结果表明:RAG 在大多数指标上的表现良好,但上下文相关性和忠实度暴露了幻觉问题。其中,Faithfulness 指标能够直接检测到与检索信息相矛盾的幻觉。

总结

本文介绍了多种用于检测 RAG 幻觉的技术,主要包括三类方法:

  1. 使用 DeepEval 库的幻觉指标;
  2. 结合思维链方法(CoT)的 G-Eval 框架;
  3. RAG 专用指标,其中包含忠实度(Faithfulness)评估。

通过示例代码展示了每种方法的具体实现,并说明如何量化 LLM 输出中的幻觉,特别是通过对比检索上下文或期望输出。这些方法可帮助更有效地检测 RAG 系统的幻觉问题,从而优化其稳定性与可靠性。

Logo

「智能机器人开发者大赛」官方平台,致力于为开发者和参赛选手提供赛事技术指导、行业标准解读及团队实战案例解析;聚焦智能机器人开发全栈技术闭环,助力开发者攻克技术瓶颈,促进软硬件集成、场景应用及商业化落地的深度研讨。 加入智能机器人开发者社区iRobot Developer,与全球极客并肩突破技术边界,定义机器人开发的未来范式!

更多推荐