用合成數據評測 RAG 系統：一份可直接上手的 DeepEval 實操指南原創

Halo咯咯

發布于 2025-10-17 08:38

瀏覽

0收藏

在構建 RAG（Retrieval-Augmented Generation，檢索增強生成）系統的過程中，很多人都有這樣的困惑：

“模型看起來能回答問題，但到底是不是在胡說八道？” “Retriever 到底找得準不準？” “我該怎么知道系統整體是不是可靠的？”

這些問題的根源在于——我們缺乏系統化的評測方法。尤其在項目早期，還沒有真實用戶數據時，想要驗證 RAG 流程的效果就更加困難。

今天，我們就來深入拆解一個實用方案： ?? 用 DeepEval 生成合成數據，系統性評測你的 RAG Pipeline。

這篇文章會帶你一步步上手，包括依賴安裝、數據生成、復雜度控制、評測邏輯等全部環節。讀完后，你不僅能快速搭建一個自動化評測體系，還能理解為什么「合成數據」是 RAG 測試的關鍵突破口。

一、為什么要用合成數據評測 RAG？

在真實業務場景中，我們希望 RAG 系統具備三個核心能力：

檢索準確（Retriever）：能找到與問題最相關的文檔；
生成可靠（LLM）：答案必須“有出處”，不能胡編；
上下文合適（Context）：輸入長度、內容密度要恰到好處。

但在系統上線前，我們往往沒有足夠的真實問題和反饋樣本。這就導致很難知道模型的回答是否“扎實落地”。

而 合成數據（Synthetic Data） 正好填補了這個空白。

通過自動生成模擬用戶問題 + 理想回答（golden pairs），我們能提前建立一個可重復測試集：

不依賴真實用戶；
能針對不同類型問題系統化覆蓋；
能反復驗證 Retriever 和 Generator 的優化效果。

DeepEval 就是這個過程的核心工具。

二、DeepEval：專為 LLM 評測設計的開源框架

DeepEval 是一個專門用于大模型評測的開源框架，支持包括 RAG 流水線在內的各種場景。它的優勢主要體現在三點：

?自動生成合成測試數據：內置??Synthesizer?? 類，可基于文檔生成真實感極強的 QA 對；
?多維度評測指標：從 Grounding（答案是否有出處）、Context Relevance（上下文相關性）到 Faithfulness（事實一致性）；
?可擴展配置：通過??EvolutionConfig?? 控制生成樣本的復雜度與類型。

接下來我們進入實操環節。

三、安裝依賴與準備環境

首先，安裝所需依賴庫。

pip install deepeval chromadb tiktoken pandas

安裝完成后，配置你的 OpenAI API Key。 DeepEval 會調用外部模型（如 GPT-4）來生成和評測數據。

前往 OpenAI API 管理頁，新建 API Key 并填入你的環境變量中：

export OPENAI_API_KEY="sk-xxxxxxx"

?? 提示：初次使用 OpenAI API 可能需要綁定支付方式并充值約 $5 才能啟用。

四、準備源文本：生成“合成問答”的素材

接下來，我們需要準備一份源文本，它將作為合成數據的“語料庫”。這份文本應盡量內容多樣、語義清晰、事實準確。

例如：

text = """
Crows are among the smartest birds, capable of using tools and recognizing human faces even after years.
In contrast, the archerfish displays remarkable precision, shooting jets of water to knock insects off branches.
Meanwhile, in the world of physics, superconductors can carry electric current with zero resistance -- a phenomenon
discovered over a century ago but still unlocking new technologies like quantum computers today.
...
"""

將其保存為一個文本文件：

with open("example.txt", "w") as f:
    f.write(text)

?? 技巧：你完全可以換成自己的內容，比如項目知識庫、技術文檔、內部 FAQ 等，這樣生成的評測樣本就更貼近業務實際。

五、自動生成合成數據（Synthetic Goldens）

DeepEval 的核心類 ??Synthesizer?? 可以直接讀取文檔并生成高質量的 QA 對。

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer(model="gpt-4.1-nano")

# 從文檔中生成合成數據
synthesizer.generate_goldens_from_docs(
    document_paths=["example.txt"],
    include_expected_output=True
)

# 打印部分結果
for golden in synthesizer.synthetic_goldens[:3]:  
    print(golden, "\n")

運行結果示例：

Input: Evaluate the cognitive abilities of corvids in facial recognition tasks.
Expected Output: Crows can recognize human faces and remember them for years, showing advanced memory and problem-solving.
Context: "Crows are among the smartest birds..."

可以看到，每個樣本都包含：

用戶問題（input）
理想回答（expected output）
語料來源（context）

這些就是我們的 golden pairs —— 可用于后續的模型性能驗證。

六、控制樣本復雜度：EvolutionConfig 的威力

光生成 QA 對還不夠，我們需要控制生成問題的復雜度與多樣性，讓測試更貼近真實用戶提問。

DeepEval 提供了 ??EvolutionConfig??，可以通過「進化策略」來調節生成方式。

from deepeval.synthesizer.config import EvolutionConfig, Evolution

evolution_config = EvolutionConfig(
    evolutions={
        Evolution.REASONING: 1/5,
        Evolution.MULTICONTEXT: 1/5,
        Evolution.COMPARATIVE: 1/5,
        Evolution.HYPOTHETICAL: 1/5,
        Evolution.IN_BREADTH: 1/5,
    },
    num_evolutions=3
)

synthesizer = Synthesizer(evolution_config=evolution_config)
synthesizer.generate_goldens_from_docs(["example.txt"])

這樣一來，生成的樣本不僅僅是簡單問答，而會覆蓋：

推理類問題（Reasoning）
多上下文問題（MultiContext）
對比類問題（Comparative）
假設場景（Hypothetical）
廣域探索問題（InBreadth）

例如：

Q：比較 Voyager 1 的黃金唱片與亞歷山大圖書館在人類歷史中的意義。A：兩者都承載了人類知識與文明的象征，前者跨越宇宙，后者見證文明的起點。

這樣的數據能全面測試模型的多層推理與信息整合能力。

七、構建迭代評測循環：RAG 改進閉環

當我們有了高質量的合成數據，就可以進入核心環節——RAG 評測閉環。

典型的流程如下：

Retriever 測試：驗證召回文檔的相關性；
LLM 評測：檢查生成回答是否基于上下文；
指標計算：如 Grounding、Context Relevance、Faithfulness；
結果反饋與優化：調整檢索策略或 Prompt；
重新評測：觀察指標是否提升。

這就是一個完整的 Iterative RAG Improvement Loop（迭代改進循環）。

它的關鍵在于：

你不需要等待真實用戶來“踩坑”，合成數據已經能讓你提前發現系統的薄弱點。

當 Retriever 的召回率提升、LLM 的事實一致性增強后，你的系統上線風險就會顯著降低。

實戰代碼見最后！

八、實戰建議與擴展思路

如果你準備在真實項目中落地 DeepEval，可以參考以下建議：

??語料選取：優先使用結構化或知識密集型文檔，如產品手冊、內部FAQ；
??模型配置：評測階段可用輕量模型（如 gpt-4.1-nano），正式驗證時切換至完整模型；
??結果分析：結合 ChromaDB 等向量庫，計算各指標變化；
??自動化集成：將評測腳本嵌入 CI/CD 流程，每次更新 Retriever 或 Prompt 后自動驗證。

長期來看，這種方式能讓你的 RAG 系統從「主觀感受好像行」變為「數據指標確實強」。

九、總結：讓 RAG 評測不再是黑箱

RAG 評測的難點在于——系統表現常常“看起來對”，但卻難以驗證背后的可靠性。 DeepEval 的出現，讓這件事變得可量化、可復現、可持續改進。

合成數據的價值不在于替代真實用戶，而在于提前建立可控的測試環境。通過 EvolutionConfig 等機制，我們甚至能模擬用戶提出各種復雜問題，全面檢驗系統的推理與檢索邊界。

一句話總結：

在沒有用戶數據的階段，合成數據就是最好的評測基線；在持續優化階段，DeepEval 就是你的自動化教練。

付實戰代碼：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
rag_iterative_eval_full.py
完整示例：迭代評測循環（RAG 改進閉環）
功能：
  - 生成/讀取文檔
  - 生成合成 goldens（DeepEval / OpenAI / 規則化）
  - 構建檢索器（OpenAI embeddings 或 TF-IDF）
  - 使用檢索到的上下文調用 LLM 生成答案（OpenAI 或簡單拼接回復）
  - 計算 grounding / context_relevance / faithfulness 指標
  - 基于指標自動調整 top_k 與 temperature（形成閉環）
  - 保存與打印每輪結果
作者：jilolo
日期：2025-10
"""

import os
import json
import time
import math
import random
import hashlib
from typing import List, Dict, Any, Tuple
from collections import defaultdict, Counter

# optional imports
try:
    import openai
except Exception:
    openai = None

try:
    import numpy as np
    from numpy.linalg import norm
    NUMPY_AVAILABLE = True
except Exception:
    NUMPY_AVAILABLE = False

try:
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    SKLEARN_AVAILABLE = True
except Exception:
    SKLEARN_AVAILABLE = False

try:
    from tqdm import tqdm
    TQDM_AVAILABLE = True
except Exception:
    TQDM_AVAILABLE = False

# -------------------------
# CONFIG
# -------------------------
CONFIG = {
    "OPENAI_API_KEY": os.getenv("OPENAI_API_KEY", ""),
    "OPENAI_EMBEDDING_MODEL": "text-embedding-3-small",
    "OPENAI_COMPLETION_MODEL": "gpt-4o-mini",  # change to available model
    "DOC_PATH": "example.txt",
    "NUM_GOLDENS": 12,
    "ITERATIONS": 6,
    "INITIAL_TOP_K": 3,
    "MAX_TOP_K": 8,
    "MIN_TOP_K": 1,
    "TEMPERATURE_OPTIONS": [0.0, 0.2, 0.5],
    "SEED": 42,
    "REPORT_FILE": "rag_eval_report.json",
    "SAVE_DIR": "rag_eval_runs",
    "PROMPT_TEMPLATE": (
        "You are a knowledgeable assistant. Use only the provided context snippets to answer the question. "
        "If the information is not present in the context, respond with 'Insufficient information in context.'\n\n"
        "Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"
    ),
    # metric thresholds for increasing/decreasing top_k
    "GROUNDING_GOOD": 0.7,
    "GROUNDING_BAD": 0.45,
    "FAITHFULNESS_GOOD": 0.7,
    "FAITHFULNESS_BAD": 0.45,
    "CONTEXT_RELEVANCE_GOOD": 0.7,
    "CONTEXT_RELEVANCE_BAD": 0.45,
}

random.seed(CONFIG["SEED"])
if openai and CONFIG["OPENAI_API_KEY"]:
    openai.api_key = CONFIG["OPENAI_API_KEY"]

# -------------------------
# Utilities
# -------------------------
def safe_print(*args, **kwargs):
    print(*args, **kwargs)

def ensure_dir(path: str):
    if not os.path.exists(path):
        os.makedirs(path, exist_ok=True)

def sha1_snippet(s: str) -> str:
    return hashlib.sha1(s.encode("utf-8")).hexdigest()[:10]

# -------------------------
# Example document (will write if missing)
# -------------------------
SAMPLE_TEXT = """Crows are among the smartest birds, capable of using tools and recognizing human faces even after years.
The archerfish displays remarkable precision, shooting jets of water to knock insects off branches.
Superconductors can carry electric current with zero resistance -- a phenomenon discovered over a century ago but still unlocking new technologies like quantum computers today.
The Library of Alexandria was once the largest center of learning, but much of its collection was lost in fires and wars.
Voyager 1 probe, launched in 1977, has left the solar system, carrying a golden record with sounds and images of Earth.
The Amazon rainforest produces roughly 20% of the world's oxygen.
Coral reefs support nearly 25% of all marine life despite covering less than 1% of the ocean floor.
MRI scanners use strong magnetic fields and radio waves to generate detailed images of organs without harmful radiation.
Moore's Law observed that the number of transistors on microchips doubles roughly every two years.
The Mariana Trench is the deepest part of Earth's oceans, reaching nearly 11,000 meters below sea level.
Ancient civilizations like the Sumerians and Egyptians invented mathematical systems thousands of years ago.
"""

def ensure_example_doc(path: str):
    if not os.path.exists(path):
        with open(path, "w", encoding="utf-8") as f:
            f.write(SAMPLE_TEXT)
        safe_print(f"[INFO] Wrote sample doc to {path}")

# -------------------------
# Synthetic golden generation (fallback-first approach)
# -------------------------
def simple_rule_based_goldens(doc_path: str, num: int = 12) -> List[Dict[str, str]]:
    """
    Very simple fallback: split document into sentences/paragraphs and craft simple Q/A.
    """
    with open(doc_path, "r", encoding="utf-8") as f:
        txt = f.read()
    paras = [p.strip() for p in txt.split("\n") if p.strip()]
    goldens = []
    for p in paras:
        q = f"What is one key fact from the following sentence: '{p[:120]}...'? "
        a = p
        goldens.append({"input": q, "expected_output": a, "context": p})
        if len(goldens) >= num:
            break
    return goldens

def openai_synthesize_goldens(doc_path: str, num: int = 12, model: str = CONFIG["OPENAI_COMPLETION_MODEL"]) -> List[Dict[str, str]]:
    """
    Try to use OpenAI to synthesize question-answer pairs.
    If OpenAI is not configured or API call fails, fall back to rule-based generation.
    """
    if openai is None or not getattr(openai, "api_key", None):
        safe_print("[WARN] OpenAI key not found - using rule-based goldens")
        return simple_rule_based_goldens(doc_path, num)
    with open(doc_path, "r", encoding="utf-8") as f:
        doc = f.read()

    prompt = (
        f"You are a dataset creator. Given the document below, produce {num} question-answer pairs. "
        f"For each pair, provide 'question', 'answer' (concise and grounded in the doc), and 'context' (the snippet). "
        f"Return a JSON array of objects.\n\nDocument:\n{doc}\n\n"
    )

    try:
        resp = openai.ChatCompletion.create(
            model=model,
            messages=[
                {"role": "system", "content": "You generate QA pairs."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.0,
            max_tokens=1500
        )
        text = resp["choices"][0]["message"]["content"]
        # find JSON in text
        start = text.find("[")
        if start >= 0:
            json_text = text[start:]
            try:
                arr = json.loads(json_text)
                goldens = []
                for item in arr[:num]:
                    q = item.get("question") or item.get("input") or item.get("q") or ""
                    a = item.get("answer") or item.get("expected_output") or ""
                    c = item.get("context") or ""
                    goldens.append({"input": q.strip(), "expected_output": a.strip(), "context": c.strip()})
                safe_print(f"[INFO] OpenAI synthesized {len(goldens)} goldens.")
                return goldens
            except Exception as e:
                safe_print("[WARN] Failed to parse JSON from OpenAI output:", e)
                return simple_rule_based_goldens(doc_path, num)
        else:
            safe_print("[WARN] OpenAI response lacking JSON - using rule-based fallback.")
            return simple_rule_based_goldens(doc_path, num)
    except Exception as e:
        safe_print("[ERROR] OpenAI call failed:", e)
        return simple_rule_based_goldens(doc_path, num)

def generate_goldens(doc_path: str, num: int = 12) -> List[Dict[str, str]]:
    # Attempt DeepEval if installed (not required here); else OpenAI; else rule-based
    # To keep dependencies light in this script we skip DeepEval auto-call.
    return openai_synthesize_goldens(doc_path, num)

# -------------------------
# Retriever: TF-IDF (fallback) and Embedding based (OpenAI)
# -------------------------
class TFIDFRetriever:
    def __init__(self, docs: List[str]):
        if not SKLEARN_AVAILABLE:
            raise RuntimeError("sklearn not available for TF-IDF retriever.")
        self.docs = docs
        self.vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words='english')
        self.doc_matrix = self.vectorizer.fit_transform(self.docs)

    def retrieve(self, query: str, top_k: int = 3) -> List[Tuple[int, float]]:
        qv = self.vectorizer.transform([query])
        sims = cosine_similarity(qv, self.doc_matrix)[0]
        idx_scores = list(enumerate(sims))
        idx_scores.sort(key=lambda x: x[1], reverse=True)
        return idx_scores[:top_k]

class OpenAIEmbeddingRetriever:
    def __init__(self, docs: List[str], embedding_model: str = CONFIG["OPENAI_EMBEDDING_MODEL"]):
        self.docs = docs
        self.embedding_model = embedding_model
        self.embeddings = []
        # compute embeddings
        self._build()

    def _embed_text(self, text: str):
        if openai is None or not getattr(openai, "api_key", None):
            # fallback: random vector (deterministic via hash)
            if NUMPY_AVAILABLE:
                h = int(hashlib_sha1_int(text))
                rng = np.random.RandomState(h % (2**32))
                return rng.normal(size=(1536,)).tolist()  # fake dim
            else:
                return [random.random() for _ in range(512)]
        try:
            resp = openai.Embedding.create(model=self.embedding_model, input=text)
            return resp["data"][0]["embedding"]
        except Exception as e:
            safe_print("[WARN] OpenAI embedding failed:", e)
            # fallback deterministic pseudo-random
            if NUMPY_AVAILABLE:
                h = int(hashlib_sha1_int(text))
                rng = np.random.RandomState(h % (2**32))
                return rng.normal(size=(1536,)).tolist()
            else:
                return [random.random() for _ in range(512)]

    def _build(self):
        self.embeddings = [self._embed_text(d) for d in self.docs]

    def retrieve(self, query: str, top_k: int = 3) -> List[Tuple[int, float]]:
        q_emb = self._embed_text(query)
        # compute cosine similarities
        if NUMPY_AVAILABLE:
            qv = np.array(q_emb, dtype=float)
            sims = []
            for emb in self.embeddings:
                ev = np.array(emb, dtype=float)
                denom = (norm(qv) * norm(ev))
                sim = float(np.dot(qv, ev) / denom) if denom > 0 else 0.0
                sims.append(sim)
            idx_scores = list(enumerate(sims))
            idx_scores.sort(key=lambda x: x[1], reverse=True)
            return idx_scores[:top_k]
        else:
            sims = []
            for emb in self.embeddings:
                sim = sum(a*b for a,b in zip(q_emb, emb)) / (len(q_emb) or 1)
                sims.append(sim)
            idx_scores = list(enumerate(sims))
            idx_scores.sort(key=lambda x: x[1], reverse=True)
            return idx_scores[:top_k]

# helper hashing for fallback embeddings
def hashlib_sha1_int(s: str) -> int:
    return int(hashlib.sha1(s.encode('utf-8')).hexdigest()[:16], 16)

# -------------------------
# Generator (LLM call) with fallback
# -------------------------
def call_openai_chat(question: str, contexts: List[str], temperature: float = 0.0, model: str = CONFIG["OPENAI_COMPLETION_MODEL"]) -> str:
    if openai is None or not getattr(openai, "api_key", None):
        # fallback: naive rule - if any context contains a sentence with overlap words, return that sentence; else "Insufficient"
        combined = " ".join(contexts)
        q_words = set([w.lower() for w in question.split() if len(w) > 3])
        best_sent = None
        best_overlap = 0
        for s in combined.split("."):
            wset = set([w.lower().strip(" ,;:()[]") for w in s.split() if len(w)>3])
            overlap = len(q_words & wset)
            if overlap > best_overlap:
                best_overlap = overlap
                best_sent = s.strip()
        if best_sent and best_overlap >= 1:
            return best_sent + "."
        return"Insufficient information in context."
    # try call
    prompt = CONFIG["PROMPT_TEMPLATE"].format(context="\n\n".join(contexts), question=question)
    try:
        resp = openai.ChatCompletion.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a precise assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=temperature,
            max_tokens=512,
        )
        text = resp["choices"][0]["message"]["content"].strip()
        return text
    except Exception as e:
        safe_print("[WARN] OpenAI ChatCompletion failed:", e)
        # fallback naive
        combined = " ".join(contexts)
        q_words = set([w.lower() for w in question.split() if len(w) > 3])
        best_sent = None
        best_overlap = 0
        for s in combined.split("."):
            wset = set([w.lower().strip(" ,;:()[]") for w in s.split() if len(w)>3])
            overlap = len(q_words & wset)
            if overlap > best_overlap:
                best_overlap = overlap
                best_sent = s.strip()
        if best_sent and best_overlap >= 1:
            return best_sent + "."
        return"Insufficient information in context."

# -------------------------
# Metrics implementations
# -------------------------
def compute_context_relevance(retrieved_idxs_scores: List[Tuple[int, float]]) -> float:
    """
    Simple metric: average similarity score (score between 0-1)
    """
    if not retrieved_idxs_scores:
        return 0.0
    scores = [s for _, s in retrieved_idxs_scores]
    # ensure in [0,1]
    clipped = [max(0.0, min(1.0, float(x))) for x in scores]
    return sum(clipped) / len(clipped)

def compute_grounding(answer: str, contexts: List[str]) -> float:
    """
    Heuristic: fraction of answer tokens that have overlap with context tokens.
    Returns 0-1.
    """
    a_words = [w.strip(" ,.;:()[]'\"").lower() for w in answer.split() if len(w) > 2]
    if not a_words:
        return 0.0
    context_text = " ".join(contexts).lower()
    hits = sum(1 for w in a_words if w in context_text)
    return hits / len(a_words)

def compute_faithfulness(answer: str, expected: str) -> float:
    """
    Very simple normalized similarity:
    - overlap ratio of important tokens (set intersection over union)
    """
    a_set = set([w.strip(" ,.;:()[]'\"").lower() for w in answer.split() if len(w)>2])
    e_set = set([w.strip(" ,.;:()[]'\"").lower() for w in expected.split() if len(w)>2])
    if not a_set and not e_set:
        return 1.0
    if not a_set or not e_set:
        return 0.0
    inter = a_set & e_set
    union = a_set | e_set
    return len(inter) / len(union)

# -------------------------
# Single-run RAG evaluation on list of goldens
# -------------------------
def run_rag_eval(
    goldens: List[Dict[str, str]],
    docs: List[str],
    retriever,
    top_k: int,
    temperature: float
) -> Dict[str, Any]:
    """
    Run through goldens, for each:
      - retrieve top_k contexts
      - call generator
      - compute metrics
    Return aggregated metrics and per-sample results
    """
    per_samples = []
    total_grounding = 0.0
    total_context_rel = 0.0
    total_faith = 0.0

    iterator = goldens if not TQDM_AVAILABLE else tqdm(goldens, desc=f"Eval top_k={top_k}, temp={temperature}")

    for g in iterator:
        q = g["input"]
        expected = g.get("expected_output", "")
        # retrieve
        retrieved = retriever.retrieve(q, top_k=top_k)
        contexts = [docs[idx] for idx, _ in retrieved]
        ctx_scores = [score for _, score in retrieved]

        # call generator
        answer = call_openai_chat(q, contexts, temperature=temperature)

        # compute metrics
        context_rel = compute_context_relevance(retrieved)
        grounding = compute_grounding(answer, contexts)
        faith = compute_faithfulness(answer, expected)

        total_context_rel += context_rel
        total_grounding += grounding
        total_faith += faith

        per_samples.append({
            "question": q,
            "expected": expected,
            "answer": answer,
            "retrieved": [{"idx": idx, "score": float(score), "snippet_hash": sha1_snippet(docs[idx])} for idx, score in retrieved],
            "metrics": {"context_relevance": context_rel, "grounding": grounding, "faithfulness": faith}
        })

    n = len(goldens)
    agg = {
        "avg_context_relevance": total_context_rel / n if n else 0.0,
        "avg_grounding": total_grounding / n if n else 0.0,
        "avg_faithfulness": total_faith / n if n else 0.0
    }
    return {"aggregate": agg, "samples": per_samples}

# -------------------------
# Iterative parameter adjustment logic
# -------------------------
def adjust_params(current_top_k: int, metrics: Dict[str, float]) -> int:
    """
    Very simple policy:
      - If grounding low -> increase top_k (more context)
      - If grounding high and context relevance low -> increase top_k
      - If grounding high & context relevance high -> try reduce top_k to optimize
    Bound by min/max.
    """
    g = metrics.get("avg_grounding", 0.0)
    cr = metrics.get("avg_context_relevance", 0.0)
    fa = metrics.get("avg_faithfulness", 0.0)
    new_top_k = current_top_k

    # if grounding is very low, expand context
    if g < CONFIG["GROUNDING_BAD"]:
        new_top_k = min(CONFIG["MAX_TOP_K"], current_top_k + 2)
    elif cr < CONFIG["CONTEXT_RELEVANCE_BAD"] and g < CONFIG["GROUNDING_GOOD"]:
        new_top_k = min(CONFIG["MAX_TOP_K"], current_top_k + 1)
    elif g > CONFIG["GROUNDING_GOOD"] and cr > CONFIG["CONTEXT_RELEVANCE_GOOD"]:
        # try shrink to save cost
        new_top_k = max(CONFIG["MIN_TOP_K"], current_top_k - 1)
    # small adjustments if faithfulness very low
    if fa < CONFIG["FAITHFULNESS_BAD"]:
        new_top_k = min(CONFIG["MAX_TOP_K"], new_top_k + 1)
    # ensure bounds
    new_top_k = max(CONFIG["MIN_TOP_K"], min(CONFIG["MAX_TOP_K"], new_top_k))
    return new_top_k

def pick_temperature(candidate_list: List[float], metrics: Dict[str, float]) -> float:
    """
    Simple heuristic: if faithfulness low, use lower temp (more deterministic).
    If faithfulness high and grounding high, allow slightly higher temp for diversity.
    """
    fa = metrics.get("avg_faithfulness", 0.0)
    g = metrics.get("avg_grounding", 0.0)
    if fa < 0.4 or g < 0.4:
        return min(candidate_list)
    if fa > 0.75 and g > 0.7:
        return max(candidate_list)
    return candidate_list[len(candidate_list)//2]

# -------------------------
# Main pipeline
# -------------------------
def main():
    safe_print("=== RAG Iterative Evaluation Demo ===")
    ensure_example_doc(CONFIG["DOC_PATH"])
    ensure_dir(CONFIG["SAVE_DIR"])

    # load docs and split into chunks (naive paragraph chunking)
    with open(CONFIG["DOC_PATH"], "r", encoding="utf-8") as f:
        doc_text = f.read()
    paragraphs = [p.strip() for p in doc_text.split("\n") if p.strip()]
    # if paragraphs too short, split sentences
    if len(paragraphs) < 5:
        # attempt sentence split
        sents = [s.strip() for s in doc_text.replace("\n", " ").split(".") if s.strip()]
        # group per 1-2 sentences
        paragraphs = []
        i = 0
        while i < len(sents):
            chunk = sents[i]
            if i+1 < len(sents):
                if random.random() < 0.5:
                    chunk = chunk + ". " + sents[i+1]
                    i += 2
                else:
                    i += 1
            else:
                i += 1
            paragraphs.append(chunk + ".")
    docs = paragraphs

    safe_print(f"[INFO] Loaded {len(docs)} document chunks for retrieval.")

    # generate goldens
    goldens = generate_goldens(CONFIG["DOC_PATH"], CONFIG["NUM_GOLDENS"])
    safe_print(f"[INFO] Generated {len(goldens)} goldens for evaluation.")

    # choose retriever: prefer OpenAI embeddings if available, else TF-IDF
    retriever = None
    use_embedding = False
    if openai and getattr(openai, "api_key", None) and NUMPY_AVAILABLE:
        try:
            retriever = OpenAIEmbeddingRetriever(docs)
            use_embedding = True
            safe_print("[INFO] Using OpenAI embedding retriever.")
        except Exception as e:
            safe_print("[WARN] OpenAIEmbeddingRetriever failed, falling back to TF-IDF:", e)
    if retriever is None:
        if SKLEARN_AVAILABLE:
            retriever = TFIDFRetriever(docs)
            safe_print("[INFO] Using TF-IDF retriever.")
        else:
            # fallback: naive substring search retriever
            class NaiveRetriever:
                def __init__(self, docs):
                    self.docs = docs
                def retrieve(self, query, top_k=3):
                    qs = query.lower()
                    scores = []
                    for i, d in enumerate(self.docs):
                        s = sum(1 for w inset(qs.split()) if w in d.lower())
                        scores.append((i, float(s)))
                    scores.sort(key=lambda x: x[1], reverse=True)
                    return scores[:top_k]
            retriever = NaiveRetriever(docs)
            safe_print("[INFO] Using naive substring retriever.")

    # iterative loop
    cur_top_k = CONFIG["INITIAL_TOP_K"]
    cur_temp = CONFIG["TEMPERATURE_OPTIONS"][0]
    history = []
    for itr in range(1, CONFIG["ITERATIONS"] + 1):
        safe_print(f"\n--- Iteration {itr} | top_k={cur_top_k} | temp={cur_temp} ---")
        result = run_rag_eval(goldens, docs, retriever, top_k=cur_top_k, temperature=cur_temp)
        agg = result["aggregate"]
        safe_print(f"[RESULT] avg_context_relevance={agg['avg_context_relevance']:.3f}, avg_grounding={agg['avg_grounding']:.3f}, avg_faithfulness={agg['avg_faithfulness']:.3f}")
        # save per-iteration
        run_record = {
            "iteration": itr,
            "top_k": cur_top_k,
            "temperature": cur_temp,
            "aggregate": agg,
            "timestamp": time.time(),
            "samples_count": len(result["samples"])
        }
        history.append(run_record)
        # adapt params
        new_top_k = adjust_params(cur_top_k, agg)
        new_temp = pick_temperature(CONFIG["TEMPERATURE_OPTIONS"], agg)
        safe_print(f"[ADAPT] next_top_k={new_top_k}, next_temp={new_temp}")
        # if no change and already good metrics, we can stop early
        if new_top_k == cur_top_k and new_temp == cur_temp and agg["avg_grounding"] > 0.8 and agg["avg_faithfulness"] > 0.8:
            safe_print("[INFO] Metrics are good and stable - stopping early.")
            break
        cur_top_k = new_top_k
        cur_temp = new_temp

    # produce final report
    report = {
        "config": CONFIG,
        "docs_count": len(docs),
        "goldens_count": len(goldens),
        "history": history
    }
    report_path = os.path.join(CONFIG["SAVE_DIR"], CONFIG["REPORT_FILE"])
    with open(report_path, "w", encoding="utf-8") as f:
        json.dump(report, f, indent=2, ensure_ascii=False)
    safe_print(f"\n[FINISH] Saved report to {report_path}")
    safe_print("=== End ===")

if __name__ == "__main__":
    main()

本文轉載自??Halo咯咯?? 作者：基咯咯

?著作權歸作者所有，如需轉載，請注明出處，否則將追究法律責任

標簽

RAG

DeepEval

檢索增強生成

已于2025-10-17 08:38:05修改

贊

回復

舉報

回復

51CTO

51CTO博客

51CTO學堂

用合成數據評測 RAG 系統：一份可直接上手的 DeepEval 實操指南原創

一、為什么要用合成數據評測 RAG？

二、DeepEval：專為 LLM 評測設計的開源框架

三、安裝依賴與準備環境

四、準備源文本：生成“合成問答”的素材

五、自動生成合成數據（Synthetic Goldens）

六、控制樣本復雜度：EvolutionConfig 的威力

七、構建迭代評測循環：RAG 改進閉環

八、實戰建議與擴展思路

九、總結：讓 RAG 評測不再是黑箱

目錄

51CTO

51CTO博客

51CTO學堂

用合成數據評測 RAG 系統：一份可直接上手的 DeepEval 實操指南 原創

一、為什么要用合成數據評測 RAG？

二、DeepEval：專為 LLM 評測設計的開源框架

三、安裝依賴與準備環境

四、準備源文本：生成“合成問答”的素材

五、自動生成合成數據（Synthetic Goldens）

六、控制樣本復雜度：EvolutionConfig 的威力

七、構建迭代評測循環：RAG 改進閉環

八、實戰建議與擴展思路

九、總結：讓 RAG 評測不再是黑箱

目錄

用合成數據評測 RAG 系統：一份可直接上手的 DeepEval 實操指南原創