總結了 13 個頂級 RAG 技術

作者：云朵君 2025-09-29 01:10:00

改進 RAG 系統的檢索和生成能力對于打造更優秀的 AI 應用至關重要。本文討論的技術涵蓋從低投入、高效率的方法（查詢重寫、重新排序）到更復雜的流程（嵌入和 LLM 微調）。最佳技術取決于你應用的具體需求和限制。

AI 能否大規模生成真正相關的答案？我們如何確保它理解復雜的多輪對話？我們如何防止它輕率地吐出錯誤的事實？這些都是現代 AI 系統面臨的挑戰，尤其是使用 RAG 構建的系統。RAG 將文檔檢索的強大功能與語言生成的流暢性相結合，使系統能夠基于上下文感知、基于事實的響應來回答問題。雖然基本的 RAG 系統在處理簡單任務時表現良好，但在處理復雜查詢、幻聽以及長時間交互中的上下文記憶時，它們往往會遇到問題。這時，高級 RAG 技術就派上用場了。

在本篇博文中，我們將探討如何升級你的 RAG 流水線，從而增強堆棧的每個階段：索引、檢索和生成。我們將逐步介紹一些強大的方法（并附上實際代碼），這些方法可以幫助你提升相關性、降低噪音并擴展系統性能——無論你是構建醫療助理、教育導師還是企業知識機器人。

基本 RAG 的不足之處是什么？

讓我們看一下基本 RAG 框架：

圖片

這個 RAG 系統架構展示了向量存儲中塊嵌入的基本存儲方式。第一步是加載文檔，然后使用各種分塊技術對其進行拆分或分塊，最后使用嵌入模型進行嵌入，以便 LLM 能夠輕松理解。

這張圖描繪了RA G的檢索和生成步驟：用戶提出一個問題，然后我們的系統通過搜索 Vector 庫，根據該問題提取結果。檢索到的內容連同問題一起傳遞給 LLM，LLM 提供結構化的輸出。

基本的 RAG 系統有明顯的局限性，尤其是在苛刻的情況下。

幻覺：幻覺是一個主要問題。該模型創建的內容在事實上是錯誤的，或者沒有源文檔的支持。這會損害可靠性，尤其是在醫學或法律等精確性至關重要的領域。
缺乏領域特異性：標準 RAG 模型難以處理專業主題。如果不根據領域的具體細節調整檢索和生成過程，系統可能會發現不相關或不準確的信息。
復雜對話：基本的 RAG 系統難以處理復雜的查詢或多輪對話。它們經常在交互過程中丟失上下文信息，導致答案不連貫或不完整。RAG 系統必須能夠處理日益復雜的查詢。

因此，我們將逐一介紹 RAG 堆棧的高級 RAG 技術，即索引、檢索和生成。我們將討論如何使用開源庫和資源進行改進。無論你構建的是醫療聊天機器人、教育機器人還是其他應用程序，這些高級 RAG 技術都具有普遍適用性。它們將改進大多數 RAG 系統。

讓我們從高級 RAG 技術開始吧！

索引和分塊：構建堅實的基礎

良好的索引對于任何 RAG 系統都至關重要。第一步涉及如何導入、拆分和存儲數據。讓我們探索索引數據的方法，重點介紹如何索引和分塊文本以及使用元數據。

1. HNSW

Hierarchical Navigable Small Worlds (HNSW)是一種在大型數據集中查找相似項的有效算法。它使用基于圖的結構化方法，幫助快速定位近似最近鄰 (ANN) 。

鄰近圖：HNSW 構建了一個圖，其中每個點都連接到附近的點。這種結構可以實現高效的搜索。
層次結構：該算法將點組織成多層。頂層連接較遠的點，而較低層連接較近的點。這種設置加快了搜索過程。
貪婪路由：HNSW 使用貪婪算法來尋找鄰居。它從高階點開始，然后移動到最近的鄰居，直到達到局部最小值。這種方法減少了查找相似項目所需的時間。

HNSW如何運作？

HNSW 的工作包括幾個關鍵部分：

輸入層：每個數據點表示為高維空間中的向量。
圖形構造：

每次將一個節點添加到圖中。
每個節點根據概率函數被分配到某一層。該函數決定了節點被放置在更高層的可能性。
該算法平衡了連接數和搜索速度。

搜索過程：

搜索從頂層的特定入口點開始。
該算法每一步都會移動到最近的鄰居。
一旦達到局部最小值，它就會轉移到下一個較低層并繼續搜索，直到找到底層的最近點。

參數：

M：連接到每個節點的鄰居數量。
efConstruction：此參數影響算法在構建圖形時考慮的鄰居數量。
efSearch：此參數影響搜索過程，確定要評估多少個鄰居。

HNSW 的設計使其能夠快速準確地找到相似的項目。這使得它成為需要在大型數據集中高效搜索的任務的理想選擇。

圖片

該圖描繪了一個簡化的 HNSW 搜索：算法從“入口點”（藍色）開始，將圖導航至“查詢向量”（黃色）。“最近鄰”（條紋）是通過基于鄰近度的邊遍歷來識別的。這說明了高效近似最近鄰搜索的圖導航核心概念。

體驗 HNSW

請按照以下步驟使用 FAISS 實現分層可導航小世界 (HNSW) 算法。本指南包含示例輸出，用于說明該過程。

步驟 1：設置 HNSW 參數

首先，定義 HNSW 索引的參數。需要指定向量的大小以及每個節點的鄰居數量。

import faiss
import numpy as np
# Set up HNSW parameters
d = 128  # Size of the vectors
M = 32   # Number of neighbors for each nodel

步驟 2：初始化 HNSW 索引

使用上面定義的參數創建 HNSW 索引。

# Initialize the HNSW index
index = faiss.IndexHNSWFlat(d, M)

步驟 3：設置 efConstruction

在將數據添加到索引之前，請設置 efConstruction 參數。此參數控制算法在構建索引時考慮的鄰居數量。

efConstruction = 200  # Example value for efConstruction
index.hnsw.efConstruction = efConstruction

步驟4：生成示例數據

在此示例中，生成要索引的隨機數據。其中，“xb”表示要索引的數據集。

# Generate random dataset of vectors
n = 10000  # Number of vectors to index
xb = np.random.random((n, d)).astype('float32')
# Add data to the index
index.add(xb)  # Build the index

步驟 5：設置 efSearch

建立索引后，設置 efSearch 參數。此參數影響搜索過程。

efSearch = 100  # Example value for efSearch
index.hnsw.efSearch = efSearch

步驟 6：執行搜索

現在，你可以搜索查詢向量的最近鄰。這里，“xq”表示查詢向量。

# Generate random query vectors
nq = 5  # Number of query vectors
xq = np.random.random((nq, d)).astype('float32')
# Perform a search for the top k nearest neighbors
k = 5  # Number of nearest neighbors to retrieve
distances, indices = index.search(xq, k)
# Output the results
print("Query Vectors:\n", xq)
print("\nNearest Neighbors Indices:\n", indices)
print("\nNearest Neighbors Distances:\n", distances)

輸出

查詢向量：

 [[0.12345678 0.23456789 ... 0.98765432] 
 [0.23456789 0.34567890 ... 0.87654321] 
 [0.34567890 0.45678901 ... 0.76543210] 
 [0.45678901 0.56789012 ... 0.65432109] 
 [0.56789012 0.67890123 ... 0.54321098]]

最近鄰索引：

 [[123 456 789 101 112] 
 [234 567 890 123 134] 
 [345 678 901 234 245] 
 [ 456 789 012 345 356] 
 [ 567 890 123 456 467]]

最近鄰距離：

 [[0.123 0.234 0.345 0.456 0.567] 
 [0.234 0.345 0.456 0.567 0.678] 
 [0.345 0.456 0.567 0.678 0.789] 
 [0.456 0.567 0.678 0.789 0.890] 
 [0.567 0.678 0.789 0.890 0.901]]

2. 語義分塊

這種方法根據含義而非固定大小劃分文本。每個塊代表一段連貫的信息。我們計算句子嵌入之間的余弦距離。如果兩個句子語義相似（低于閾值），則將它們歸入同一塊。這會根據內容的含義創建不同長度的塊。

優點：創建更連貫、更有意義的塊，改善檢索。
缺點：需要更多計算（使用基于 BERT 的編碼器）。

動手語義分塊

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(OpenAIEmbeddings())
docs = text_splitter.create_documents([document])
print(docs[0].page_content)

這段代碼使用了 LangChain 的 SemanticChunker，它使用 OpenAI 嵌入將文檔拆分為語義相關的塊。它創建的文檔塊旨在捕獲連貫的語義單元，而不是任意的文本片段。

3.基于語言模型的分塊

這種先進的方法使用語言模型從文本中創建完整的語句。每個塊在語義上都是完整的。語言模型（例如，一個擁有 70 億個參數的模型）負責處理文本。它將文本分解成各自有意義的語句。然后，該模型將這些語句組合成塊，在完整性和上下文之間取得平衡。這種方法計算量很大，但準確率很高。

優點：適應文本的細微差別并創建高質量的塊。
缺點：計算成本高；可能需要針對特定用途進行微調。

基于語言模型的分塊實踐

async def generate_contexts(document, chunks):
   asyncdef process_chunk(chunk):
       response = await client.chat.completions.create(
           model="gpt-4o",
           messages=[
               {"role": "system", "content": "Generate a brief context explaining how this chunk relates to the full document."},
               {"role": "user", "content": f"<document> \n{document} \n</document> \nHere is the chunk we want to situate within the whole document \n<chunk> \n{chunk} \n</chunk> \nPlease give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."}
           ],
           temperature=0.3,
           max_tokens=100
       )
       context = response.choices[0].message.content
       returnf"{context} {chunk}"
   # Process all chunks concurrently
   contextual_chunks = await asyncio.gather(
       *[process_chunk(chunk) for chunk in chunks]
   )
   return contextual_chunks

此代碼片段利用 LLM（可能是 OpenAI 的 deepseek，通過 client.chat.completions.create 調用）為文檔的每個塊生成上下文信息。它異步處理每個塊，促使 LLM 解釋該塊與完整文檔的關系。最后，它返回一個原始塊列表，并在列表前面添加了生成的上下文，從而有效地豐富了這些塊，從而改進了搜索檢索。

4. 利用元數據：添加上下文

添加和過濾元數據

元數據提供額外的上下文信息，從而提高檢索的準確性。通過添加日期、患者年齡和既往病史等元數據，你可以在搜索過程中過濾掉不相關的信息。過濾功能可以縮小搜索范圍，提高檢索效率和相關性。索引時，請將元數據與文本一起存儲。

例如，醫療保健數據包括患者記錄中的年齡、就診日期和具體病情。使用這些元數據來篩選搜索結果，確保系統只檢索相關信息。例如，如果查詢與兒童相關，則過濾掉 18 歲以上患者的記錄。這可以減少噪音并提高相關性。

例子

塊 #1

Source Metadata:  {'id': 'doc:1c6f3e3f7ee14027bc856822871572dc:26e9aac7d5494208a56ff0c6cbbfda20', 'source': 'https://plato.stanford.edu/entries/goedel/'}

源文本：

2.2.1 The First Incompleteness Theorem
In his Logical Journey (Wang 1996) Hao Wang published the
full text of material G?del had written (at Wang’s request)
about his discovery of the incompleteness theorems. This material had
formed the basis of Wang’s “Some Facts about Kurt
G?del,” and was read and approved by G?del:

塊 #2

Source Metadata:  {'id': 'doc:1c6f3e3f7ee14027bc856822871572dc:d15f62c453c64072b768e136080cb5ba', 'source': 'https://plato.stanford.edu/entries/goedel/'}

源文本：

The First Incompleteness Theorem provides a counterexample to
completeness by exhibiting an arithmetic statement which is neither
provable nor refutable in Peano arithmetic, though true in the
standard model. The Second Incompleteness Theorem shows that the
consistency of arithmetic cannot be proved in arithmetic itself. Thus
G?del’s theorems demonstrated the infeasibility of the
Hilbert program, if it is to be characterized by those particular
desiderata, consistency and completeness.

在這里，我們可以看到元數據包含塊的唯一 ID 和來源，這為塊提供了更多上下文并有助于輕松檢索。

5. 使用 GLiNER 生成元數據

你不會總是擁有大量的元數據，但使用像 GLiNER 這樣的模型可以動態生成元數據！GLiNER 在攝取過程中標記和標記塊以創建元數據。

執行

為每個塊添加標簽以供 GLiNER 識別。如果找到標簽，它會對其進行標記。如果沒有匹配的標簽，則不會生成標簽。通常情況下效果良好，但對于小眾數據集可能需要進行微調。這提高了檢索準確率，但增加了一個處理步驟。GLiNER 可以解析傳入的查詢，并將其與元數據標簽進行匹配以進行過濾。

GLiNER：使用雙向 Transformer 進行命名實體識別的通用模型演示：

這些技術構建了強大的 RAG 系統，能夠高效地從大型數據集中檢索數據。分塊和元數據的使用取決于數據集的具體需求和特性。

檢索：找到正確的信息

現在，我們來關注一下 RAG 中的“R”。如何改進向量數據庫的檢索？這指的是檢索與查詢相關的所有文檔。這大大提高了 LLM 生成高質量結果的幾率。以下是一些技巧：

6.Hybrid Search混合搜索

結合向量搜索（查找語義含義）和關鍵詞搜索（查找精確匹配）。混合搜索兼具兩者的優勢。在人工智能領域，許多術語都是特定的關鍵詞：算法名稱、技術術語、LLM (LLM)。單獨的向量搜索可能會遺漏這些術語。關鍵詞搜索可以確保這些重要術語得到考慮。結合兩種方法可以創建更完整的檢索流程。這些搜索同時運行。

使用加權系統對結果進行合并和排序。例如，使用 Weaviate，你可以調整 alpha 參數來平衡向量和關鍵詞結果。這樣就可以創建一個合并的排序列表。

優點：平衡精度和召回率，提高檢索質量。
缺點：需要仔細調整重量。

動手混合搜索

from langchain_community.retrievers import WeaviateHybridSearchRetriever
from langchain_core.documents import Document
retriever = WeaviateHybridSearchRetriever(
   client=client,
   index_name="LangChain",
   text_key="text",
   attributes=[],
   create_schema_if_missing=True,
)
retriever.invoke("the ethical implications of AI")

此代碼初始化了一個 WeaviateHybridSearchRetriever，用于從 Weaviate 矢量數據庫中檢索文檔。它將矢量搜索和關鍵字搜索結合到了 Weaviate 的混合檢索功能中。最后，它執行了一個名為“the ethical implications of AI”的查詢，并使用此混合方法檢索相關文檔。

7.查詢重寫

認識到人類查詢可能并非數據庫或語言模型的最佳選擇。使用語言模型重寫查詢可以顯著提高檢索效果。

向量數據庫重寫：這將用戶的初始查詢轉換為數據庫友好的格式。例如，“什么是人工智能代理以及為什么它們是 2025 年的下一個大事件”可以轉換為“人工智能代理是 2025 年的重大事件”。我們可以使用任何 LLM 來重寫查詢，以便它能夠捕捉查詢的重要方面。
語言模型的提示重寫：這涉及自動創建提示以優化與語言模型的交互。這可以提高結果的質量和準確性。我們可以使用 DSPy 等框架或任何 LLM 來重寫查詢。這些重寫的查詢和提示確保搜索過程檢索到相關文檔，并有效地提示語言模型。

多查詢檢索

查詢措辭的細微變化可能會導致檢索結果不同。如果嵌入不能準確反映數據的含義，這個問題可能會更加突出。為了應對這些挑戰，通常會采用快速工程或調優的方法，但這個過程可能非常耗時。

MultiQueryRetriever簡化了這項任務。它使用大型語言模型 (LLM)，基于單個用戶輸入從不同角度創建多個查詢。對于每個生成的查詢，它會檢索一組相關文檔。MultiQueryRetriever通過整合所有查詢的獨特結果，提供了更廣泛的潛在相關文檔集。這種方法提高了找到有用信息的機會，而無需進行大量的手動調整。

from langchain_openai import ChatOpenAI
chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0)
from langchain.retrievers.multi_query import MultiQueryRetriever
# Set logging for the queries
import logging
similarity_retriever3 = chroma_db3.as_retriever(search_type="similarity",
                                               search_kwargs={"k": 2})
mq_retriever = MultiQueryRetriever.from_llm(
   retriever=similarity_retriever3, llm=chatgpt,
   include_original=True
)
logging.basicConfig()
# so we can see what queries are generated by the LLM
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)
query = "what is the capital of India?"
docs = mq_retriever.invoke(query)
docs

此代碼使用 LangChain 構建了一個多查詢檢索系統。它會生成輸入查詢（“what is the capital of India?”）的多個變體。然后，這些變體會通過相似性檢索器查詢 Chroma 向量數據庫 (chroma_db3)，旨在擴大搜索范圍并捕獲各種相關文檔。MultiQueryRetriever 最終會聚合并返回檢索到的文檔。

輸出

[Document(metadata={'article_id': '5117', 'title': 'New Delhi'},
 page_cnotallow='New Delhi () is the capital of India and a union territory of
 the megacity of Delhi. It has a very old history and is home to several
 monuments where the city is expensive to live in. In traditional Indian
 geography it falls under the North Indian zone. The city has an area of
 about 42.7\xa0km. New Delhi has a population of about 9.4 Million people.'),

 Document(metadata={'article_id': '4062', 'title': 'Kolkata'},
 page_cnotallow="Kolkata (spelled Calcutta before 1 January 2001) is the
 capital city of the Indian state of West Bengal. It is the second largest
 city in India after Mumbai. It is on the east bank of the River Hooghly.
 When it is called Calcutta, it includes the suburbs. This makes it the third
 largest city of India. This also makes it the world's 8th largest
 metropolitan area as defined by the United Nations. Kolkata served as the
 capital of India during the British Raj until 1911. Kolkata was once the
 center of industry and education. However, it has witnessed political
 violence and economic problems since 1954. Since 2000, Kolkata has grown due
 to economic growth. Like other metropolitan cities in India, Kolkata
 struggles with poverty, pollution and traffic congestion."),

 Document(metadata={'article_id': '22215', 'title': 'States and union
 territories of India'}, page_cnotallow='The Republic of India is divided into
 twenty-eight States,and eight union territories including the National
 Capital Territory.')]

8. LLM基于提示的上下文壓縮檢索

上下文壓縮有助于提高檢索文檔的相關性。這主要通過兩種方式實現：

提取相關內容：刪除檢索到的文檔中與查詢無關的部分。這意味著只保留回答問題的部分。
過濾不相關文檔：排除與查詢無關的文檔，而不改變文檔本身的內容。

為了實現這一點，我們可以使用 LLMChainExtractor，它會審查最初返回的文檔，并僅提取與查詢相關的內容。它也可能刪除完全不相關的文檔。

以下是使用 LangChain 實現此目的的方法：

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI
# Initialize the language model
chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0)

# Set up a similarity retriever
similarity_retriever = chroma_db3.as_retriever(search_type="similarity", search_kwargs={"k": 3})

# Create the extractor to get relevant content
compressor = LLMChainExtractor.from_llm(llm=chatgpt)

# Combine the retriever and the extractor
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=similarity_retriever)

# Example query
query = "What is the capital of India?"
docs = compression_retriever.invoke(query)
print(docs)

輸出：

[Document(metadata={'article_id': '5117', 'title': 'New Delhi'},
 page_cnotallow='New Delhi is the capital of India and a union territory of the 
megacity of Delhi.')]

對于不同的查詢：

query = "What is the old capital of India?"
docs = compression_retriever.invoke(query)
print(docs)

輸出

[Document(metadata={'article_id': '4062', 'title': 'Kolkata'},
 page_cnotallow='Kolkata served as the capital of India during the British Raj
 until 1911.')]

LLMChainFilter 提供了一種更簡單但有效的文檔過濾方法。它使用 LLM 鏈來決定哪些文檔需要保留、哪些文檔需要丟棄，且不會改變文檔的內容。

以下是實現過濾器的方法：

from langchain.retrievers.document_compressors import LLMChainFilter
# Set up the filter
_filter = LLMChainFilter.from_llm(llm=chatgpt)
# Combine the retriever and the filter
compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=similarity_retriever)

# Example query
query = "What is the capital of India?"
docs = compression_retriever.invoke(query)
print(docs)

輸出：

[Document(metadata={'article_id': '5117', 'title': 'New Delhi'},
 page_cnotallow='New Delhi is the capital of India and a union territory of the
 megacity of Delhi.')]

對于另一個查詢：

query = "What is the old capital of India?"
docs = compression_retriever.invoke(query)
print(docs)

輸出：

[Document(metadata={'article_id': '4062', 'title': 'Kolkata'},
 page_cnotallow='Kolkata served as the capital of India during the British Raj
 until 1911.')]

這些策略通過關注相關內容來幫助優化檢索過程。“LLMChainExtractor”僅提取文檔的必要部分，而“LLMChainFilter”則決定保留哪些文檔。這兩種方法都提高了檢索信息的質量，使其與用戶的查詢更加相關。

9. 微調嵌入模型

預先訓練好的嵌入模型是一個不錯的開始。根據你的數據對這些模型進行微調可以顯著提升檢索效果。

選擇合適的模型：對于醫學等專業領域，請選擇基于相關數據預訓練的模型。例如，你可以使用 MedCPT 系列查詢和文檔編碼器，這些編碼器已基于 PubMed 搜索日志中的 2.55 億個查詢-文章對進行大規模預訓練。

使用正樣本對和負樣本對進行微調：收集你自己的數據，并創建相似（正樣本）和不相似（負樣本）的樣本對。對模型進行微調以理解這些差異。這有助于模型學習特定領域的關系，從而改進檢索。

優點：提高檢索性能。
缺點：需要精心創建的訓練數據。

這些技術的組合構建了一個強大的檢索系統。這提高了提供給 LLM 的對象的相關性，從而提升了生成質量。

生成：制作高質量的響應

最后，我們來討論一下如何提高語言模型 (LLM) 的生成質量。目標是為 LLM 提供盡可能與提示相關的上下文。不相關的數據可能會引發幻覺。以下是一些提高生成質量的技巧：

10.自動剪切以刪除不相關信息

Autocut 會過濾掉從數據庫中檢索到的不相關信息，從而防止LLM 被誤導。

檢索和評分相似度：進行查詢時，將檢索具有相似度分數的多個對象。
識別并剔除：使用相似度得分找到一個得分顯著下降的臨界點。排除超出此點的對象。這確保只向 LLM 提供最相關的信息。例如，如果你檢索了六個對象，那么在第四個之后，得分可能會急劇下降。通過查看變化率，你可以確定要排除哪些對象。

from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from typing import List
from langchain_core.documents import Document
from langchain_core.runnables import chain
vectorstore = PineconeVectorStore.from_documents(
   docs, index_name="sample", embedding=OpenAIEmbeddings()
)
@chain
def retriever(query: str):
   docs, scores = zip(*vectorstore.similarity_search_with_score(query))
   for doc, score in zip(docs, scores):
       doc.metadata["score"] = score
   return docs
 result = retriever.invoke("dinosaur")
result

此代碼片段使用 LangChain 和 Pinecone 執行相似性搜索。它使用 OpenAI 嵌入來嵌入文檔，將其存儲在 Pinecone 向量存儲中，并定義一個檢索器函數。檢索器搜索與給定查詢（“dinosaur”）相似的文檔，計算相似度分數，并將這些分數添加到文檔元數據中，然后返回結果。

輸出

[Document(page_cnotallow='In her second book, Dr. Simmons delves deeper into
 the ethical considerations surrounding AI development and deployment. It is
 an eye-opening examination of the dilemmas faced by developers,
 policymakers, and society at large.', metadata={}),

 Document(page_cnotallow='A comprehensive analysis of the evolution of
 artificial intelligence, from its inception to its future prospects. Dr.
 Simmons covers ethical considerations, potentials, and threats posed by
 AI.', metadata={}),

 Document(page_cnotallow="In his follow-up to 'Symbiosis', Prof. Sterling takes
 a look at the subtle, unnoticed presence and influence of AI in our everyday
 lives. It reveals how AI has become woven into our routines, often without
 our explicit realization.", metadata={}),

 Document(page_cnotallow='Prof. Sterling explores the potential for harmonious
coexistence between humans and artificial intelligence. The book discusses
 how AI can be integrated into society in a beneficial and non-disruptive
manner.', metadata={})]

我們可以看到，它還給出了相似度分數，我們可以根據閾值進行截斷。

11. 重新排序檢索到的對象

重新排序使用更高級的模型來重新評估和排序最初檢索到的對象。這可以提高最終檢索集的質量。

過度獲取：最初檢索的對象多于所需。
應用排序模型：使用高延遲模型（通常是交叉編碼器）重新評估相關性。該模型會逐對考慮查詢和每個對象，以重新評估相似度。
重新排序結果：根據新的評估結果，重新排序對象。將最相關的結果置于頂部。這可確保最相關的文檔優先顯示，從而改進提供給LLM (LLM) 的數據。

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)

compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(
   base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.invoke(
   "What did the president say about Ketanji Jackson Brown"
)
print([doc.metadata["id"] for doc in compressed_docs])
pretty_print_docs(compressed_docs)

此代碼片段利用 ContextualCompressionRetriever 中的 FlashrankRerank 來提升檢索到的文檔的相關性。它根據查詢“總統對 Ketanji Jackson Brown 有何評價”的相關性，對基礎檢索器（用 retriever 表示）獲取的文檔進行重新排序。最后，它會打印文檔 ID 以及壓縮后、重新排序后的文檔。

輸出

[0, 5, 3]

Document 1:

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

----------------------------------------------------------------------------------------------------

Document 2:

He met the Ukrainian people.

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland.

In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight.

----------------------------------------------------------------------------------------------------

Document 3:

And tonight, I’m announcing that the Justice Department will name a chief prosecutor for pandemic fraud.

By the end of this year, the deficit will be down to less than half what it was before I took office.  

The only president ever to cut the deficit by more than one trillion dollars in a single year.

Lowering your costs also means demanding more competition.

I’m a capitalist, but capitalism without competition isn’t capitalism.

It’s exploitation—and it drives up prices.

輸出顯示它根據相關性對檢索到的塊進行重新排序。

12. 微調LLM

使用特定領域數據對 LLM 進行微調可以顯著提升其性能。例如，使用 Meditron 70B 這樣的模型。這是針對醫療數據對 LLaMA 2 70b 進行微調的版本，同時使用了以下兩種方法：

無監督微調：繼續對大量特定領域的文本進行預訓練。
監督微調：使用監督學習針對特定領域任務（例如醫學多項選擇題）進一步完善模型。這種專門的訓練有助于模型在目標領域表現良好。在某些特定任務上，它的表現優于基礎模型以及規模更大、專業性較低的模型（例如 GPT-3.5）。

微調

此圖表示針對特定任務示例進行微調的過程。這種方法允許開發人員指定所需的輸出、鼓勵某些行為，或更好地控制模型的響應。

13. 使用 RAFT：將語言模型適配到特定領域的 RAG

RAFT（檢索增強微調）是一種改進大型語言模型 (LLM) 在特定領域工作方式的方法。它可以幫助這些模型利用文檔中的相關信息更準確地回答問題。

檢索增強微調：RAFT 將微調與檢索方法相結合。這使得模型在訓練過程中能夠從有用和不太有用的文檔中學習。
思路鏈推理：該模型生成的答案展現了其推理過程。這有助于它根據檢索到的文檔提供清晰準確的響應。
動態文檔處理：RAFT 訓練模型查找和使用最相關的文檔，同時忽略那些無助于回答問題的文檔。

RAFT 的架構

RAFT 架構包含幾個關鍵組件：

輸入層：模型輸入一個問題（Q）和一組檢索到的文檔（D），其中包括相關文檔和不相關文檔。
處理層：

該模型分析輸入以在文檔中查找重要信息。
它創建了一個引用相關文檔的答案（A*）。

輸出層：模型根據相關文檔產生最終答案，同時忽略不相關的文檔。
訓練機制：在訓練過程中，一些數據包含相關和不相關的文檔，而其他數據僅包含不相關的文檔。這種設置鼓勵模型專注于上下文而不是記憶。
評估：根據模型使用檢索到的文檔準確回答問題的能力來評估模型的性能。

通過采用這種架構，RAFT 增強了模型在特定領域的工作能力，并提供了一種生成準確且相關響應的可靠方法。

RAFT 的架構

左上圖展示了一種調整 LLM 的方法，使其能夠從一組正向文檔和干擾文檔中讀取解決方案。與標準 RAG 設置不同，標準 RAG 設置基于檢索器輸出進行訓練，檢索器輸出是記憶和閱讀的混合。測試時，所有方法均遵循標準 RAG 設置，并在上下文中提供前 k 個檢索到的文檔。

結論

改進 RAG 系統的檢索和生成能力對于打造更優秀的 AI 應用至關重要。本文討論的技術涵蓋從低投入、高效率的方法（查詢重寫、重新排序）到更復雜的流程（嵌入和 LLM 微調）。最佳技術取決于你應用的具體需求和限制。先進的 RAG 技術，如果經過深思熟慮地應用，可以幫助開發人員構建更準確、更可靠、更具備情境感知能力的 AI 系統，從而處理復雜的信息需求。

責任編輯：武曉燕來源：數據STUDIO

RAG 技術系統

總結了 13 個 頂級 RAG 技術

基本 RAG 的不足之處是什么？

索引和分塊：構建堅實的基礎

1. HNSW

HNSW如何運作？

體驗 HNSW

步驟 1：設置 HNSW 參數

步驟 2：初始化 HNSW 索引

步驟 3：設置 efConstruction

步驟4：生成示例數據

步驟 5：設置 efSearch

步驟 6：執行搜索

輸出

2. 語義分塊

動手語義分塊

3.基于語言模型的分塊

基于語言模型的分塊實踐

4. 利用元數據：添加上下文

添加和過濾元數據

例子

5. 使用 GLiNER 生成元數據

執行

檢索：找到正確的信息

6.Hybrid Search混合搜索

動手混合搜索

7.查詢重寫

多查詢檢索

8. LLM基于提示的上下文壓縮檢索

9. 微調嵌入模型

生成：制作高質量的響應

10.自動剪切以刪除不相關信息

輸出

11. 重新排序檢索到的對象

輸出

12. 微調LLM

13. 使用 RAFT：將語言模型適配到特定領域的 RAG

RAFT 的架構

結論

總結了 13 個頂級 RAG 技術