精品欧美一区二区三区在线观看 _久久久久国色av免费观看性色_国产精品久久在线观看_亚洲第一综合网站_91精品又粗又猛又爽_小泽玛利亚一区二区免费_91亚洲精品国偷拍自产在线观看 _久久精品视频在线播放_美女精品久久久_欧美日韩国产成人在线

用合成數據評測 RAG 系統:一份可直接上手的 DeepEval 實操指南 原創

發布于 2025-10-17 08:38
瀏覽
0收藏

在構建 RAG(Retrieval-Augmented Generation,檢索增強生成)系統的過程中,很多人都有這樣的困惑:

“模型看起來能回答問題,但到底是不是在胡說八道?” “Retriever 到底找得準不準?” “我該怎么知道系統整體是不是可靠的?”

這些問題的根源在于——我們缺乏系統化的評測方法。 尤其在項目早期,還沒有真實用戶數據時,想要驗證 RAG 流程的效果就更加困難。

今天,我們就來深入拆解一個實用方案: ?? 用 DeepEval 生成合成數據,系統性評測你的 RAG Pipeline。

這篇文章會帶你一步步上手,包括依賴安裝、數據生成、復雜度控制、評測邏輯等全部環節。 讀完后,你不僅能快速搭建一個自動化評測體系,還能理解為什么「合成數據」是 RAG 測試的關鍵突破口。

一、為什么要用合成數據評測 RAG?

在真實業務場景中,我們希望 RAG 系統具備三個核心能力:

  1. 檢索準確(Retriever):能找到與問題最相關的文檔;
  2. 生成可靠(LLM):答案必須“有出處”,不能胡編;
  3. 上下文合適(Context):輸入長度、內容密度要恰到好處。

但在系統上線前,我們往往沒有足夠的真實問題和反饋樣本。 這就導致很難知道模型的回答是否“扎實落地”。

而 合成數據(Synthetic Data) 正好填補了這個空白。

通過自動生成模擬用戶問題 + 理想回答(golden pairs),我們能提前建立一個可重復測試集:

  • 不依賴真實用戶;
  • 能針對不同類型問題系統化覆蓋;
  • 能反復驗證 Retriever 和 Generator 的優化效果。

DeepEval 就是這個過程的核心工具。

二、DeepEval:專為 LLM 評測設計的開源框架

DeepEval 是一個專門用于大模型評測的開源框架,支持包括 RAG 流水線在內的各種場景。 它的優勢主要體現在三點:

  • ?自動生成合成測試數據:內置??Synthesizer?? 類,可基于文檔生成真實感極強的 QA 對;
  • ?多維度評測指標:從 Grounding(答案是否有出處)、Context Relevance(上下文相關性)到 Faithfulness(事實一致性);
  • ?可擴展配置:通過??EvolutionConfig?? 控制生成樣本的復雜度與類型。

接下來我們進入實操環節。

三、安裝依賴與準備環境

首先,安裝所需依賴庫。

pip install deepeval chromadb tiktoken pandas

安裝完成后,配置你的 OpenAI API Key。 DeepEval 會調用外部模型(如 GPT-4)來生成和評測數據。

前往 OpenAI API 管理頁, 新建 API Key 并填入你的環境變量中:

export OPENAI_API_KEY="sk-xxxxxxx"

?? 提示: 初次使用 OpenAI API 可能需要綁定支付方式并充值約 $5 才能啟用。

四、準備源文本:生成“合成問答”的素材

接下來,我們需要準備一份源文本,它將作為合成數據的“語料庫”。 這份文本應盡量內容多樣、語義清晰、事實準確

例如:

text = """
Crows are among the smartest birds, capable of using tools and recognizing human faces even after years.
In contrast, the archerfish displays remarkable precision, shooting jets of water to knock insects off branches.
Meanwhile, in the world of physics, superconductors can carry electric current with zero resistance -- a phenomenon
discovered over a century ago but still unlocking new technologies like quantum computers today.
...
"""

將其保存為一個文本文件:

with open("example.txt", "w") as f:
    f.write(text)

?? 技巧: 你完全可以換成自己的內容,比如項目知識庫、技術文檔、內部 FAQ 等,這樣生成的評測樣本就更貼近業務實際。

五、自動生成合成數據(Synthetic Goldens)

DeepEval 的核心類 ??Synthesizer?? 可以直接讀取文檔并生成高質量的 QA 對。

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer(model="gpt-4.1-nano")

# 從文檔中生成合成數據
synthesizer.generate_goldens_from_docs(
    document_paths=["example.txt"],
    include_expected_output=True
)

# 打印部分結果
for golden in synthesizer.synthetic_goldens[:3]:  
    print(golden, "\n")

運行結果示例:

Input: Evaluate the cognitive abilities of corvids in facial recognition tasks.
Expected Output: Crows can recognize human faces and remember them for years, showing advanced memory and problem-solving.
Context: "Crows are among the smartest birds..."

可以看到,每個樣本都包含:

  • 用戶問題(input)
  • 理想回答(expected output)
  • 語料來源(context)

這些就是我們的 golden pairs —— 可用于后續的模型性能驗證。

六、控制樣本復雜度:EvolutionConfig 的威力

光生成 QA 對還不夠,我們需要控制生成問題的復雜度與多樣性,讓測試更貼近真實用戶提問。

DeepEval 提供了 ??EvolutionConfig??,可以通過「進化策略」來調節生成方式。

from deepeval.synthesizer.config import EvolutionConfig, Evolution

evolution_config = EvolutionConfig(
    evolutions={
        Evolution.REASONING: 1/5,
        Evolution.MULTICONTEXT: 1/5,
        Evolution.COMPARATIVE: 1/5,
        Evolution.HYPOTHETICAL: 1/5,
        Evolution.IN_BREADTH: 1/5,
    },
    num_evolutions=3
)

synthesizer = Synthesizer(evolution_config=evolution_config)
synthesizer.generate_goldens_from_docs(["example.txt"])

這樣一來,生成的樣本不僅僅是簡單問答,而會覆蓋:

  • 推理類問題(Reasoning)
  • 多上下文問題(MultiContext)
  • 對比類問題(Comparative)
  • 假設場景(Hypothetical)
  • 廣域探索問題(InBreadth)

例如:

Q: 比較 Voyager 1 的黃金唱片與亞歷山大圖書館在人類歷史中的意義。A: 兩者都承載了人類知識與文明的象征,前者跨越宇宙,后者見證文明的起點。

這樣的數據能全面測試模型的多層推理與信息整合能力。

七、構建迭代評測循環:RAG 改進閉環

當我們有了高質量的合成數據,就可以進入核心環節——RAG 評測閉環

典型的流程如下:

  1. Retriever 測試:驗證召回文檔的相關性;
  2. LLM 評測:檢查生成回答是否基于上下文;
  3. 指標計算:如 Grounding、Context Relevance、Faithfulness;
  4. 結果反饋與優化:調整檢索策略或 Prompt;
  5. 重新評測:觀察指標是否提升。

這就是一個完整的 Iterative RAG Improvement Loop(迭代改進循環)

它的關鍵在于:

你不需要等待真實用戶來“踩坑”, 合成數據已經能讓你提前發現系統的薄弱點。

當 Retriever 的召回率提升、LLM 的事實一致性增強后,你的系統上線風險就會顯著降低。

實戰代碼見最后!

八、實戰建議與擴展思路

如果你準備在真實項目中落地 DeepEval,可以參考以下建議:

  • ??語料選取:優先使用結構化或知識密集型文檔,如產品手冊、內部FAQ;
  • ??模型配置:評測階段可用輕量模型(如 gpt-4.1-nano),正式驗證時切換至完整模型;
  • ??結果分析:結合 ChromaDB 等向量庫,計算各指標變化;
  • ??自動化集成:將評測腳本嵌入 CI/CD 流程,每次更新 Retriever 或 Prompt 后自動驗證。

長期來看,這種方式能讓你的 RAG 系統從「主觀感受好像行」變為「數據指標確實強」。

九、總結:讓 RAG 評測不再是黑箱

RAG 評測的難點在于——系統表現常常“看起來對”,但卻難以驗證背后的可靠性。 DeepEval 的出現,讓這件事變得可量化、可復現、可持續改進。

合成數據的價值不在于替代真實用戶,而在于提前建立可控的測試環境。通過 EvolutionConfig 等機制,我們甚至能模擬用戶提出各種復雜問題,全面檢驗系統的推理與檢索邊界。

一句話總結:

在沒有用戶數據的階段,合成數據就是最好的評測基線; 在持續優化階段,DeepEval 就是你的自動化教練。

付實戰代碼:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
rag_iterative_eval_full.py
完整示例:迭代評測循環(RAG 改進閉環)
功能:
  - 生成/讀取文檔
  - 生成合成 goldens(DeepEval / OpenAI / 規則化)
  - 構建檢索器(OpenAI embeddings 或 TF-IDF)
  - 使用檢索到的上下文調用 LLM 生成答案(OpenAI 或簡單拼接回復)
  - 計算 grounding / context_relevance / faithfulness 指標
  - 基于指標自動調整 top_k 與 temperature(形成閉環)
  - 保存與打印每輪結果
作者:jilolo
日期:2025-10
"""

import os
import json
import time
import math
import random
import hashlib
from typing import List, Dict, Any, Tuple
from collections import defaultdict, Counter

# optional imports
try:
    import openai
except Exception:
    openai = None

try:
    import numpy as np
    from numpy.linalg import norm
    NUMPY_AVAILABLE = True
except Exception:
    NUMPY_AVAILABLE = False

try:
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    SKLEARN_AVAILABLE = True
except Exception:
    SKLEARN_AVAILABLE = False

try:
    from tqdm import tqdm
    TQDM_AVAILABLE = True
except Exception:
    TQDM_AVAILABLE = False

# -------------------------
# CONFIG
# -------------------------
CONFIG = {
    "OPENAI_API_KEY": os.getenv("OPENAI_API_KEY", ""),
    "OPENAI_EMBEDDING_MODEL": "text-embedding-3-small",
    "OPENAI_COMPLETION_MODEL": "gpt-4o-mini",  # change to available model
    "DOC_PATH": "example.txt",
    "NUM_GOLDENS": 12,
    "ITERATIONS": 6,
    "INITIAL_TOP_K": 3,
    "MAX_TOP_K": 8,
    "MIN_TOP_K": 1,
    "TEMPERATURE_OPTIONS": [0.0, 0.2, 0.5],
    "SEED": 42,
    "REPORT_FILE": "rag_eval_report.json",
    "SAVE_DIR": "rag_eval_runs",
    "PROMPT_TEMPLATE": (
        "You are a knowledgeable assistant. Use only the provided context snippets to answer the question. "
        "If the information is not present in the context, respond with 'Insufficient information in context.'\n\n"
        "Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"
    ),
    # metric thresholds for increasing/decreasing top_k
    "GROUNDING_GOOD": 0.7,
    "GROUNDING_BAD": 0.45,
    "FAITHFULNESS_GOOD": 0.7,
    "FAITHFULNESS_BAD": 0.45,
    "CONTEXT_RELEVANCE_GOOD": 0.7,
    "CONTEXT_RELEVANCE_BAD": 0.45,
}

random.seed(CONFIG["SEED"])
if openai and CONFIG["OPENAI_API_KEY"]:
    openai.api_key = CONFIG["OPENAI_API_KEY"]

# -------------------------
# Utilities
# -------------------------
def safe_print(*args, **kwargs):
    print(*args, **kwargs)

def ensure_dir(path: str):
    if not os.path.exists(path):
        os.makedirs(path, exist_ok=True)

def sha1_snippet(s: str) -> str:
    return hashlib.sha1(s.encode("utf-8")).hexdigest()[:10]

# -------------------------
# Example document (will write if missing)
# -------------------------
SAMPLE_TEXT = """Crows are among the smartest birds, capable of using tools and recognizing human faces even after years.
The archerfish displays remarkable precision, shooting jets of water to knock insects off branches.
Superconductors can carry electric current with zero resistance -- a phenomenon discovered over a century ago but still unlocking new technologies like quantum computers today.
The Library of Alexandria was once the largest center of learning, but much of its collection was lost in fires and wars.
Voyager 1 probe, launched in 1977, has left the solar system, carrying a golden record with sounds and images of Earth.
The Amazon rainforest produces roughly 20% of the world's oxygen.
Coral reefs support nearly 25% of all marine life despite covering less than 1% of the ocean floor.
MRI scanners use strong magnetic fields and radio waves to generate detailed images of organs without harmful radiation.
Moore's Law observed that the number of transistors on microchips doubles roughly every two years.
The Mariana Trench is the deepest part of Earth's oceans, reaching nearly 11,000 meters below sea level.
Ancient civilizations like the Sumerians and Egyptians invented mathematical systems thousands of years ago.
"""

def ensure_example_doc(path: str):
    if not os.path.exists(path):
        with open(path, "w", encoding="utf-8") as f:
            f.write(SAMPLE_TEXT)
        safe_print(f"[INFO] Wrote sample doc to {path}")

# -------------------------
# Synthetic golden generation (fallback-first approach)
# -------------------------
def simple_rule_based_goldens(doc_path: str, num: int = 12) -> List[Dict[str, str]]:
    """
    Very simple fallback: split document into sentences/paragraphs and craft simple Q/A.
    """
    with open(doc_path, "r", encoding="utf-8") as f:
        txt = f.read()
    paras = [p.strip() for p in txt.split("\n") if p.strip()]
    goldens = []
    for p in paras:
        q = f"What is one key fact from the following sentence: '{p[:120]}...'? "
        a = p
        goldens.append({"input": q, "expected_output": a, "context": p})
        if len(goldens) >= num:
            break
    return goldens

def openai_synthesize_goldens(doc_path: str, num: int = 12, model: str = CONFIG["OPENAI_COMPLETION_MODEL"]) -> List[Dict[str, str]]:
    """
    Try to use OpenAI to synthesize question-answer pairs.
    If OpenAI is not configured or API call fails, fall back to rule-based generation.
    """
    if openai is None or not getattr(openai, "api_key", None):
        safe_print("[WARN] OpenAI key not found - using rule-based goldens")
        return simple_rule_based_goldens(doc_path, num)
    with open(doc_path, "r", encoding="utf-8") as f:
        doc = f.read()

    prompt = (
        f"You are a dataset creator. Given the document below, produce {num} question-answer pairs. "
        f"For each pair, provide 'question', 'answer' (concise and grounded in the doc), and 'context' (the snippet). "
        f"Return a JSON array of objects.\n\nDocument:\n{doc}\n\n"
    )

    try:
        resp = openai.ChatCompletion.create(
            model=model,
            messages=[
                {"role": "system", "content": "You generate QA pairs."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.0,
            max_tokens=1500
        )
        text = resp["choices"][0]["message"]["content"]
        # find JSON in text
        start = text.find("[")
        if start >= 0:
            json_text = text[start:]
            try:
                arr = json.loads(json_text)
                goldens = []
                for item in arr[:num]:
                    q = item.get("question") or item.get("input") or item.get("q") or ""
                    a = item.get("answer") or item.get("expected_output") or ""
                    c = item.get("context") or ""
                    goldens.append({"input": q.strip(), "expected_output": a.strip(), "context": c.strip()})
                safe_print(f"[INFO] OpenAI synthesized {len(goldens)} goldens.")
                return goldens
            except Exception as e:
                safe_print("[WARN] Failed to parse JSON from OpenAI output:", e)
                return simple_rule_based_goldens(doc_path, num)
        else:
            safe_print("[WARN] OpenAI response lacking JSON - using rule-based fallback.")
            return simple_rule_based_goldens(doc_path, num)
    except Exception as e:
        safe_print("[ERROR] OpenAI call failed:", e)
        return simple_rule_based_goldens(doc_path, num)

def generate_goldens(doc_path: str, num: int = 12) -> List[Dict[str, str]]:
    # Attempt DeepEval if installed (not required here); else OpenAI; else rule-based
    # To keep dependencies light in this script we skip DeepEval auto-call.
    return openai_synthesize_goldens(doc_path, num)

# -------------------------
# Retriever: TF-IDF (fallback) and Embedding based (OpenAI)
# -------------------------
class TFIDFRetriever:
    def __init__(self, docs: List[str]):
        if not SKLEARN_AVAILABLE:
            raise RuntimeError("sklearn not available for TF-IDF retriever.")
        self.docs = docs
        self.vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words='english')
        self.doc_matrix = self.vectorizer.fit_transform(self.docs)

    def retrieve(self, query: str, top_k: int = 3) -> List[Tuple[int, float]]:
        qv = self.vectorizer.transform([query])
        sims = cosine_similarity(qv, self.doc_matrix)[0]
        idx_scores = list(enumerate(sims))
        idx_scores.sort(key=lambda x: x[1], reverse=True)
        return idx_scores[:top_k]

class OpenAIEmbeddingRetriever:
    def __init__(self, docs: List[str], embedding_model: str = CONFIG["OPENAI_EMBEDDING_MODEL"]):
        self.docs = docs
        self.embedding_model = embedding_model
        self.embeddings = []
        # compute embeddings
        self._build()

    def _embed_text(self, text: str):
        if openai is None or not getattr(openai, "api_key", None):
            # fallback: random vector (deterministic via hash)
            if NUMPY_AVAILABLE:
                h = int(hashlib_sha1_int(text))
                rng = np.random.RandomState(h % (2**32))
                return rng.normal(size=(1536,)).tolist()  # fake dim
            else:
                return [random.random() for _ in range(512)]
        try:
            resp = openai.Embedding.create(model=self.embedding_model, input=text)
            return resp["data"][0]["embedding"]
        except Exception as e:
            safe_print("[WARN] OpenAI embedding failed:", e)
            # fallback deterministic pseudo-random
            if NUMPY_AVAILABLE:
                h = int(hashlib_sha1_int(text))
                rng = np.random.RandomState(h % (2**32))
                return rng.normal(size=(1536,)).tolist()
            else:
                return [random.random() for _ in range(512)]

    def _build(self):
        self.embeddings = [self._embed_text(d) for d in self.docs]

    def retrieve(self, query: str, top_k: int = 3) -> List[Tuple[int, float]]:
        q_emb = self._embed_text(query)
        # compute cosine similarities
        if NUMPY_AVAILABLE:
            qv = np.array(q_emb, dtype=float)
            sims = []
            for emb in self.embeddings:
                ev = np.array(emb, dtype=float)
                denom = (norm(qv) * norm(ev))
                sim = float(np.dot(qv, ev) / denom) if denom > 0 else 0.0
                sims.append(sim)
            idx_scores = list(enumerate(sims))
            idx_scores.sort(key=lambda x: x[1], reverse=True)
            return idx_scores[:top_k]
        else:
            sims = []
            for emb in self.embeddings:
                sim = sum(a*b for a,b in zip(q_emb, emb)) / (len(q_emb) or 1)
                sims.append(sim)
            idx_scores = list(enumerate(sims))
            idx_scores.sort(key=lambda x: x[1], reverse=True)
            return idx_scores[:top_k]

# helper hashing for fallback embeddings
def hashlib_sha1_int(s: str) -> int:
    return int(hashlib.sha1(s.encode('utf-8')).hexdigest()[:16], 16)

# -------------------------
# Generator (LLM call) with fallback
# -------------------------
def call_openai_chat(question: str, contexts: List[str], temperature: float = 0.0, model: str = CONFIG["OPENAI_COMPLETION_MODEL"]) -> str:
    if openai is None or not getattr(openai, "api_key", None):
        # fallback: naive rule - if any context contains a sentence with overlap words, return that sentence; else "Insufficient"
        combined = " ".join(contexts)
        q_words = set([w.lower() for w in question.split() if len(w) > 3])
        best_sent = None
        best_overlap = 0
        for s in combined.split("."):
            wset = set([w.lower().strip(" ,;:()[]") for w in s.split() if len(w)>3])
            overlap = len(q_words & wset)
            if overlap > best_overlap:
                best_overlap = overlap
                best_sent = s.strip()
        if best_sent and best_overlap >= 1:
            return best_sent + "."
        return"Insufficient information in context."
    # try call
    prompt = CONFIG["PROMPT_TEMPLATE"].format(context="\n\n".join(contexts), question=question)
    try:
        resp = openai.ChatCompletion.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a precise assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=temperature,
            max_tokens=512,
        )
        text = resp["choices"][0]["message"]["content"].strip()
        return text
    except Exception as e:
        safe_print("[WARN] OpenAI ChatCompletion failed:", e)
        # fallback naive
        combined = " ".join(contexts)
        q_words = set([w.lower() for w in question.split() if len(w) > 3])
        best_sent = None
        best_overlap = 0
        for s in combined.split("."):
            wset = set([w.lower().strip(" ,;:()[]") for w in s.split() if len(w)>3])
            overlap = len(q_words & wset)
            if overlap > best_overlap:
                best_overlap = overlap
                best_sent = s.strip()
        if best_sent and best_overlap >= 1:
            return best_sent + "."
        return"Insufficient information in context."

# -------------------------
# Metrics implementations
# -------------------------
def compute_context_relevance(retrieved_idxs_scores: List[Tuple[int, float]]) -> float:
    """
    Simple metric: average similarity score (score between 0-1)
    """
    if not retrieved_idxs_scores:
        return 0.0
    scores = [s for _, s in retrieved_idxs_scores]
    # ensure in [0,1]
    clipped = [max(0.0, min(1.0, float(x))) for x in scores]
    return sum(clipped) / len(clipped)

def compute_grounding(answer: str, contexts: List[str]) -> float:
    """
    Heuristic: fraction of answer tokens that have overlap with context tokens.
    Returns 0-1.
    """
    a_words = [w.strip(" ,.;:()[]'\"").lower() for w in answer.split() if len(w) > 2]
    if not a_words:
        return 0.0
    context_text = " ".join(contexts).lower()
    hits = sum(1 for w in a_words if w in context_text)
    return hits / len(a_words)

def compute_faithfulness(answer: str, expected: str) -> float:
    """
    Very simple normalized similarity:
    - overlap ratio of important tokens (set intersection over union)
    """
    a_set = set([w.strip(" ,.;:()[]'\"").lower() for w in answer.split() if len(w)>2])
    e_set = set([w.strip(" ,.;:()[]'\"").lower() for w in expected.split() if len(w)>2])
    if not a_set and not e_set:
        return 1.0
    if not a_set or not e_set:
        return 0.0
    inter = a_set & e_set
    union = a_set | e_set
    return len(inter) / len(union)

# -------------------------
# Single-run RAG evaluation on list of goldens
# -------------------------
def run_rag_eval(
    goldens: List[Dict[str, str]],
    docs: List[str],
    retriever,
    top_k: int,
    temperature: float
) -> Dict[str, Any]:
    """
    Run through goldens, for each:
      - retrieve top_k contexts
      - call generator
      - compute metrics
    Return aggregated metrics and per-sample results
    """
    per_samples = []
    total_grounding = 0.0
    total_context_rel = 0.0
    total_faith = 0.0

    iterator = goldens if not TQDM_AVAILABLE else tqdm(goldens, desc=f"Eval top_k={top_k}, temp={temperature}")

    for g in iterator:
        q = g["input"]
        expected = g.get("expected_output", "")
        # retrieve
        retrieved = retriever.retrieve(q, top_k=top_k)
        contexts = [docs[idx] for idx, _ in retrieved]
        ctx_scores = [score for _, score in retrieved]

        # call generator
        answer = call_openai_chat(q, contexts, temperature=temperature)

        # compute metrics
        context_rel = compute_context_relevance(retrieved)
        grounding = compute_grounding(answer, contexts)
        faith = compute_faithfulness(answer, expected)

        total_context_rel += context_rel
        total_grounding += grounding
        total_faith += faith

        per_samples.append({
            "question": q,
            "expected": expected,
            "answer": answer,
            "retrieved": [{"idx": idx, "score": float(score), "snippet_hash": sha1_snippet(docs[idx])} for idx, score in retrieved],
            "metrics": {"context_relevance": context_rel, "grounding": grounding, "faithfulness": faith}
        })

    n = len(goldens)
    agg = {
        "avg_context_relevance": total_context_rel / n if n else 0.0,
        "avg_grounding": total_grounding / n if n else 0.0,
        "avg_faithfulness": total_faith / n if n else 0.0
    }
    return {"aggregate": agg, "samples": per_samples}

# -------------------------
# Iterative parameter adjustment logic
# -------------------------
def adjust_params(current_top_k: int, metrics: Dict[str, float]) -> int:
    """
    Very simple policy:
      - If grounding low -> increase top_k (more context)
      - If grounding high and context relevance low -> increase top_k
      - If grounding high & context relevance high -> try reduce top_k to optimize
    Bound by min/max.
    """
    g = metrics.get("avg_grounding", 0.0)
    cr = metrics.get("avg_context_relevance", 0.0)
    fa = metrics.get("avg_faithfulness", 0.0)
    new_top_k = current_top_k

    # if grounding is very low, expand context
    if g < CONFIG["GROUNDING_BAD"]:
        new_top_k = min(CONFIG["MAX_TOP_K"], current_top_k + 2)
    elif cr < CONFIG["CONTEXT_RELEVANCE_BAD"] and g < CONFIG["GROUNDING_GOOD"]:
        new_top_k = min(CONFIG["MAX_TOP_K"], current_top_k + 1)
    elif g > CONFIG["GROUNDING_GOOD"] and cr > CONFIG["CONTEXT_RELEVANCE_GOOD"]:
        # try shrink to save cost
        new_top_k = max(CONFIG["MIN_TOP_K"], current_top_k - 1)
    # small adjustments if faithfulness very low
    if fa < CONFIG["FAITHFULNESS_BAD"]:
        new_top_k = min(CONFIG["MAX_TOP_K"], new_top_k + 1)
    # ensure bounds
    new_top_k = max(CONFIG["MIN_TOP_K"], min(CONFIG["MAX_TOP_K"], new_top_k))
    return new_top_k

def pick_temperature(candidate_list: List[float], metrics: Dict[str, float]) -> float:
    """
    Simple heuristic: if faithfulness low, use lower temp (more deterministic).
    If faithfulness high and grounding high, allow slightly higher temp for diversity.
    """
    fa = metrics.get("avg_faithfulness", 0.0)
    g = metrics.get("avg_grounding", 0.0)
    if fa < 0.4 or g < 0.4:
        return min(candidate_list)
    if fa > 0.75 and g > 0.7:
        return max(candidate_list)
    return candidate_list[len(candidate_list)//2]

# -------------------------
# Main pipeline
# -------------------------
def main():
    safe_print("=== RAG Iterative Evaluation Demo ===")
    ensure_example_doc(CONFIG["DOC_PATH"])
    ensure_dir(CONFIG["SAVE_DIR"])

    # load docs and split into chunks (naive paragraph chunking)
    with open(CONFIG["DOC_PATH"], "r", encoding="utf-8") as f:
        doc_text = f.read()
    paragraphs = [p.strip() for p in doc_text.split("\n") if p.strip()]
    # if paragraphs too short, split sentences
    if len(paragraphs) < 5:
        # attempt sentence split
        sents = [s.strip() for s in doc_text.replace("\n", " ").split(".") if s.strip()]
        # group per 1-2 sentences
        paragraphs = []
        i = 0
        while i < len(sents):
            chunk = sents[i]
            if i+1 < len(sents):
                if random.random() < 0.5:
                    chunk = chunk + ". " + sents[i+1]
                    i += 2
                else:
                    i += 1
            else:
                i += 1
            paragraphs.append(chunk + ".")
    docs = paragraphs

    safe_print(f"[INFO] Loaded {len(docs)} document chunks for retrieval.")

    # generate goldens
    goldens = generate_goldens(CONFIG["DOC_PATH"], CONFIG["NUM_GOLDENS"])
    safe_print(f"[INFO] Generated {len(goldens)} goldens for evaluation.")

    # choose retriever: prefer OpenAI embeddings if available, else TF-IDF
    retriever = None
    use_embedding = False
    if openai and getattr(openai, "api_key", None) and NUMPY_AVAILABLE:
        try:
            retriever = OpenAIEmbeddingRetriever(docs)
            use_embedding = True
            safe_print("[INFO] Using OpenAI embedding retriever.")
        except Exception as e:
            safe_print("[WARN] OpenAIEmbeddingRetriever failed, falling back to TF-IDF:", e)
    if retriever is None:
        if SKLEARN_AVAILABLE:
            retriever = TFIDFRetriever(docs)
            safe_print("[INFO] Using TF-IDF retriever.")
        else:
            # fallback: naive substring search retriever
            class NaiveRetriever:
                def __init__(self, docs):
                    self.docs = docs
                def retrieve(self, query, top_k=3):
                    qs = query.lower()
                    scores = []
                    for i, d in enumerate(self.docs):
                        s = sum(1 for w inset(qs.split()) if w in d.lower())
                        scores.append((i, float(s)))
                    scores.sort(key=lambda x: x[1], reverse=True)
                    return scores[:top_k]
            retriever = NaiveRetriever(docs)
            safe_print("[INFO] Using naive substring retriever.")

    # iterative loop
    cur_top_k = CONFIG["INITIAL_TOP_K"]
    cur_temp = CONFIG["TEMPERATURE_OPTIONS"][0]
    history = []
    for itr in range(1, CONFIG["ITERATIONS"] + 1):
        safe_print(f"\n--- Iteration {itr} | top_k={cur_top_k} | temp={cur_temp} ---")
        result = run_rag_eval(goldens, docs, retriever, top_k=cur_top_k, temperature=cur_temp)
        agg = result["aggregate"]
        safe_print(f"[RESULT] avg_context_relevance={agg['avg_context_relevance']:.3f}, avg_grounding={agg['avg_grounding']:.3f}, avg_faithfulness={agg['avg_faithfulness']:.3f}")
        # save per-iteration
        run_record = {
            "iteration": itr,
            "top_k": cur_top_k,
            "temperature": cur_temp,
            "aggregate": agg,
            "timestamp": time.time(),
            "samples_count": len(result["samples"])
        }
        history.append(run_record)
        # adapt params
        new_top_k = adjust_params(cur_top_k, agg)
        new_temp = pick_temperature(CONFIG["TEMPERATURE_OPTIONS"], agg)
        safe_print(f"[ADAPT] next_top_k={new_top_k}, next_temp={new_temp}")
        # if no change and already good metrics, we can stop early
        if new_top_k == cur_top_k and new_temp == cur_temp and agg["avg_grounding"] > 0.8 and agg["avg_faithfulness"] > 0.8:
            safe_print("[INFO] Metrics are good and stable - stopping early.")
            break
        cur_top_k = new_top_k
        cur_temp = new_temp

    # produce final report
    report = {
        "config": CONFIG,
        "docs_count": len(docs),
        "goldens_count": len(goldens),
        "history": history
    }
    report_path = os.path.join(CONFIG["SAVE_DIR"], CONFIG["REPORT_FILE"])
    with open(report_path, "w", encoding="utf-8") as f:
        json.dump(report, f, indent=2, ensure_ascii=False)
    safe_print(f"\n[FINISH] Saved report to {report_path}")
    safe_print("=== End ===")

if __name__ == "__main__":
    main()


本文轉載自??Halo咯咯??    作者:基咯咯

?著作權歸作者所有,如需轉載,請注明出處,否則將追究法律責任
已于2025-10-17 08:38:05修改
收藏
回復
舉報
回復
相關推薦
国产第100页| 久久精品无码专区| 九色porny在线| 国产成人av一区二区三区在线| 国模精品一区二区三区色天香| 亚洲中文字幕无码av| 全球最大av网站久久| 亚洲精品国产视频| 蜜桃日韩视频| 国产人妖在线播放| 午夜综合激情| 欧美日本精品在线| 无码少妇一区二区| 国产精品调教视频| 7777精品久久久大香线蕉| 欧美牲交a欧美牲交| 国产原创在线观看| 国产亲近乱来精品视频| 成人激情av| 一区二区小视频| 亚洲精选国产| 美女少妇精品视频| 手机看片日韩av| 久久久久97| 欧美一级精品在线| 中文字幕网av| 成人看片网页| 欧美日韩在线影院| av在线播放天堂| 91cn在线观看| 成人免费小视频| 午夜精品区一区二区三| 三级在线电影| 不卡视频一二三四| 成人欧美一区二区三区视频| 国产精品嫩草影院桃色| 男人操女人的视频在线观看欧美 | 黄色成人在线观看| 国产精品视频第一区| 欧美lavv| 日韩精品123| 99久久精品一区| 99re在线视频上| 国产内射老熟女aaaa∵| 久久国产日韩欧美精品| 国产精品福利片| 伊人成年综合网| 石原莉奈在线亚洲二区| 青青a在线精品免费观看| 日韩xxx高潮hd| 精品成人久久| 17婷婷久久www| 日韩不卡在线播放| 久久狠狠婷婷| 日韩av男人的天堂| 日韩黄色一级视频| 日韩成人午夜电影| 成人黄色免费网站在线观看| 国产免费无遮挡| 国产老肥熟一区二区三区| 亚洲wwwav| 亚洲春色一区二区三区| 懂色av一区二区三区蜜臀| 国产二区不卡| 香蕉国产在线视频| 久久久久久久综合色一本| 欧美日韩亚洲综合一区二区三区激情在线| 深夜福利视频一区| 国产欧美精品一区二区三区四区| 亚洲精品tv久久久久久久久| 欧美成人xxx| 亚洲一区二区在线免费观看视频| 欧美一区二区激情| 黑人巨大亚洲一区二区久| 在线观看一区不卡| 久久久精品高清| 91成人短视频| 亚洲乱码国产乱码精品精天堂| 91视频免费在观看| 欧美韩国一区| 91高清视频免费观看| 国产精品午夜一区二区| 精品一区二区影视| 国产精品夜夜夜一区二区三区尤| 色综合888| 中文字幕一区二区视频| 国产精品69久久久| 日本欧美一区| 亚洲精品一区二区在线观看| 日本少妇高潮喷水xxxxxxx| 99国产精品免费视频观看| 久久久久久久国产精品| 天天射天天干天天| 国产成人精品1024| 日本精品一区二区三区高清 久久| 在线免费黄色| 夜夜精品视频一区二区| 不要播放器的av网站| 国产精久久久| 亚洲欧美一区二区三区在线| 麻豆精品国产免费| 欧美亚洲网站| 91国产在线免费观看| 免费播放片a高清在线观看| 亚洲欧美日韩一区二区三区在线观看| 日本欧美黄色片| 4438五月综合| 亚洲性无码av在线| 国产五月天婷婷| 久久丁香综合五月国产三级网站| 狠狠色综合色区| 老司机精品影院| 色视频欧美一区二区三区| 亚洲欧美日韩中文字幕在线观看| 精品毛片免费观看| 4438全国成人免费| 亚洲精品免费在线观看视频| 国产精品人成在线观看免费| 欧美性大战久久久久xxx| 秋霞一区二区| 北条麻妃久久精品| wwwwww在线观看| 26uuu国产电影一区二区| 97在线免费视频观看| 亚洲国产天堂| 在线视频一区二区| 9i精品福利一区二区三区| 成人国产亚洲欧美成人综合网| 国产对白在线播放| 少妇高潮一区二区三区99| 亚洲人精选亚洲人成在线| 日韩三级免费看| 成人小视频在线观看| 日本免费黄色小视频| 色噜噜成人av在线| 三级精品视频久久久久| 欧美性猛交xxxx乱大交hd| 久久亚洲精精品中文字幕早川悠里| 999久久欧美人妻一区二区| 高清不卡一区| 久久久999国产精品| 97在线视频人妻无码| 综合在线观看色| 中文字幕资源在线观看| 亚洲草久电影| 91传媒视频免费| 欧美xxxbbb| 亚洲第一福利网站| 日本午夜视频在线观看| 久久―日本道色综合久久| 成人在线免费观看av| 亚洲香蕉视频| 国产精品777| 999国产在线视频| 欧美巨大另类极品videosbest | 亚洲黄网站黄| 国产精品二区在线| 色戒汤唯在线| 亚洲一区二区精品| 中文字幕+乱码+中文| 亚洲欧洲精品天堂一级| 又黄又爽又色的视频| 国产精品www994| 精品高清视频| 巨胸喷奶水www久久久免费动漫| 国产一区二区三区网站| 亚洲网站在线免费观看| 亚洲欧美日韩一区二区| 亚洲色偷偷色噜噜狠狠99网| 国产精品普通话对白| 亚洲成色www久久网站| 99tv成人影院| 97碰碰碰免费色视频| 六十路在线观看| 欧美高清视频在线高清观看mv色露露十八| 日韩在线中文字幕视频| 成人免费视频网站在线观看| 国产一区二区三区精彩视频| 欧美精品系列| 99精彩视频在线观看免费| zzzwww在线看片免费| 国产一区二区三区久久精品| 国产美女www爽爽爽视频| 午夜久久久影院| 精品视频第一页| 粉嫩欧美一区二区三区高清影视| 日本日本19xxxⅹhd乱影响| 日韩成人综合| 国产伦视频一区二区三区| 日韩精品一区二区三区| 久久成人在线视频| 久草福利在线视频| 日韩午夜在线影院| 中文字幕国产在线观看| 亚洲欧洲制服丝袜| 中文字幕国产专区| 国产激情视频一区二区三区欧美| 亚洲精品无码久久久久久| 国产精品久久久久久久久久10秀| 精品一区二区三区国产| 日韩欧乱色一区二区三区在线| 午夜精品美女自拍福到在线| 日本在线人成| 亚洲精品视频播放| 国产99久一区二区三区a片| 色屁屁一区二区| 久久精品www人人爽人人| 国产精品久久久久久久久动漫 | 中文字幕五月欧美| aa片在线观看视频在线播放| 国产成人av一区二区三区在线观看| 男人女人黄一级| 亚洲麻豆视频| 久久福利一区二区| 日韩在线观看| 日韩高清在线播放| 亚洲va久久久噜噜噜久久| 国产激情一区二区三区在线观看| 日韩在线你懂得| 国产精品美女999| 北岛玲heyzo一区二区| 欧美激情欧美激情在线五月| 老司机午夜在线| 日韩在线观看网站| 国产福利免费在线观看| 亚洲欧美国产日韩中文字幕| 人妻va精品va欧美va| 日韩一二三四区| 国产成人精品毛片| 欧美一区午夜精品| 国产麻豆免费视频| 911国产精品| 一区二区三区亚洲视频| 欧美体内she精视频| 黄色av一区二区| 色av综合在线| 波多野结衣视频免费观看| 欧美午夜影院在线视频| 天堂网一区二区三区| 婷婷久久综合九色综合伊人色| 麻豆亚洲av熟女国产一区二 | 中文字幕色呦呦| 久久久久蜜桃| 日本a级片在线观看| 亚洲天天影视网| 成人午夜免费剧场| 亚洲视频福利| 精品无码国产一区二区三区av| 韩国av一区| 91免费黄视频| 亚洲欧美日本国产专区一区| 国产免费黄色av| 日韩电影在线一区| 亚洲少妇久久久| 久久国内精品视频| 亚洲综合20p| 粉嫩蜜臀av国产精品网站| 男男一级淫片免费播放| 久久伊人中文字幕| 四虎成人免费影院| 日韩毛片视频在线看| 久草中文在线视频| 福利视频一区二区| wwwwww在线观看| 91精品欧美一区二区三区综合在 | 精品无人国产偷自产在线| 国产在线自天天| 久久久精品网站| xxxx视频在线| 国产成人欧美在线观看| 国产aⅴ精品一区二区四区| 国产激情美女久久久久久吹潮| 日本一道高清一区二区三区| 手机成人在线| 重囗味另类老妇506070| 91猫先生在线| 久久精品久久99精品久久| 原创真实夫妻啪啪av| 99热在这里有精品免费| 日韩一级片在线免费观看| 亚洲美腿欧美偷拍| 五月婷婷色丁香| 欧美日韩在线免费视频| 欧日韩在线视频| 中日韩午夜理伦电影免费| 黑人精品视频| 国产精品九九九| 第四色中文综合网| 日韩一区免费观看| 亚洲手机在线| 亚洲娇小娇小娇小| av一二三不卡影片| www色aa色aawww| 欧美午夜精品在线| www.日韩在线观看| 揄拍成人国产精品视频| 96av在线| 91久久久久久| 国模吧精品视频| 97免费视频观看| 麻豆精品国产91久久久久久| 亚洲国产第一区| 亚洲激情男女视频| www五月天com| 日韩一级欧美一级| 成人好色电影| 91av免费观看91av精品在线| 国产精品igao视频网网址不卡日韩| 麻豆91蜜桃| 欧美午夜免费影院| 亚洲免费黄色网| 国产调教视频一区| 国产女同在线观看| 精品国产乱码久久久久久免费| 婷婷视频在线| 国产精品久久久久久久app| 久久99偷拍| 分分操这里只有精品| 国产成人小视频| 爱爱视频免费在线观看| 欧美日韩亚洲综合一区二区三区| 免费毛片在线| 欧美一级淫片videoshd| 巨人精品**| 日韩极品视频在线观看 | 99久久免费看精品国产一区| 亚洲女同ⅹxx女同tv| 亚洲一卡二卡在线| 正在播放欧美一区| 亚洲精品555| 亚洲精品一区二区三| 日韩制服丝袜先锋影音| 欧美 日韩 国产 成人 在线观看 | 中文字幕无码精品亚洲35| 粉嫩av亚洲一区二区图片| 国产精品成人免费观看| 69久久夜色精品国产69蝌蚪网| 色老头视频在线观看| 国产精品久久久久久久一区探花| 国产成人影院| 久草在在线视频| 国产欧美日韩综合精品一区二区| 无码人妻精品一区二区三区9厂| 亚洲美女自拍视频| 亚洲综合电影| 青娱乐一区二区| 免费在线看一区| 永久免费看mv网站入口| 91精品国产高清一区二区三区 | 欧美亚洲另类在线| 国产精品网在线观看| 日本a视频在线观看| 91看片淫黄大片一级在线观看| 中文字幕一区二区人妻电影| 亚洲偷欧美偷国内偷| 国产成人福利夜色影视| 中文字幕中文字幕在线中心一区 | 色婷婷粉嫩av| 日韩三级电影网址| 9999精品成人免费毛片在线看| 久久精品国产第一区二区三区最新章节| 国产视频久久| 美国美女黄色片| 欧美一区二区高清| 成人免费网站观看| 欧美日韩亚洲综合一区二区三区激情在线| 日本在线不卡视频| 中文字幕在线有码| 亚洲激情视频在线播放| 午夜日韩成人影院| 在线无限看免费粉色视频| 国产美女精品人人做人人爽| 福利一区二区三区四区| 亚洲欧美精品一区| 成人永久在线| 欧美丰满熟妇bbbbbb百度| 中文字幕巨乱亚洲| 丰满肉嫩西川结衣av| 日本一区二区在线播放| 天天做天天爱天天综合网| 年下总裁被打光屁股sp| 在线精品视频小说1| 亚洲无线看天堂av| 明星裸体视频一区二区| 狠狠狠色丁香婷婷综合激情| 91九色丨porny丨肉丝| 日韩在线播放av| 一区二区三区日本久久久| 超碰成人在线播放| 精品高清美女精品国产区| 在线免费观看黄| 久久99精品久久久久久青青日本 | 91在线直播亚洲| 久久婷婷一区| 国产一级性生活| 中文字幕亚洲一区在线观看| 精品国产影院| 伊人五月天婷婷| 欧美三级欧美一级| 亚洲综合电影|