領先一步
VMware 提供培訓和認證,助您加速進步。
瞭解更多
評估大型語言模型 (LLM) 輸出的挑戰對於眾所周知的不確定性 AI 應用程式至關重要,尤其是在它們投入生產時。
當評估現代 LLM 產生的細緻入微、上下文相關的響應時,ROUGE 和 BLEU 等傳統指標顯得力不從心。人工評估雖然準確,但成本高昂、速度慢且無法擴充套件。
引入 LLM-as-a-Judge —— 一種強大的技術,它使用 LLM 本身來評估 AI 生成內容的質量。研究表明,複雜的評判模型與人類判斷的一致性高達 85%,這實際上高於人與人之間的一致性(81%)。
在本文中,我們將探討 Spring AI 的 遞迴顧問(Recursive Advisors) 如何為實現 LLM-as-a-Judge 模式提供一個優雅的框架,使您能夠構建具有自動化質量控制的自我改進 AI 系統。要了解有關遞迴顧問 API 的更多資訊,請查閱我們之前的文章:使用 Spring AI 遞迴顧問建立自我改進的 AI 代理。
💡 演示:在 evaluation-recursive-advisor-demo 中找到完整的示例實現。
LLM-as-a-Judge 是一種評估方法,其中大型語言模型(LLM)評估其他模型或其自身生成內容的質量。LLM-as-a-Judge 不僅僅依賴於人類評估員或傳統的自動化指標,而是利用 LLM 根據預定義的標準對響應進行評分、分類或比較。
為什麼它有效? 評估本質上比生成更容易。當您使用 LLM 作為評判時,您要求它執行一個更簡單、更專注的任務(評估現有文字的特定屬性),而不是建立原創內容並平衡多個約束的複雜任務。一個很好的類比是,批評比創造更容易。發現問題比預防問題更簡單。
LLM-as-a-Judge 評估模式 主要有兩種
LLM 評判評估質量維度,例如相關性、事實準確性、對來源的忠實性、指令依從性以及跨醫療保健、金融、RAG 系統和對話等領域的整體連貫性和清晰度。
雖然像 GPT-4 和 Claude 這樣的通用模型可以作為有效的評判,但專門的 LLM-as-a-Judge 模型在評估任務中始終優於它們。Judge Arena 排行榜 跟蹤各種模型在評判任務中的表現。
Spring AI 的 ChatClient 提供了一個流式 API,非常適合實現 LLM-as-a-Judge 模式。其 Advisors 系統 允許您以模組化、可重用的方式攔截、修改和增強 AI 互動。
最近引入的 遞迴顧問(Recursive Advisors) 透過啟用迴圈模式進一步擴充套件了這一點,這非常適合自我完善的評估工作流
public class MyRecursiveAdvisor implements CallAdvisor {
@Override
public ChatClientResponse adviseCall(ChatClientRequest request, CallAdvisorChain chain) {
// Call the chain initially
ChatClientResponse response = chain.nextCall(request);
// Check if we need to retry based on evaluation
while (!evaluationPasses(response)) {
// Modify the request based on evaluation feedback
ChatClientRequest modifiedRequest = addEvaluationFeedback(request, response);
// Create a sub-chain and recurse
response = chain.copy(this).nextCall(modifiedRequest);
}
return response;
}
}
我們將實現一個 SelfRefineEvaluationAdvisor,它使用 Spring AI 的遞迴顧問來體現 LLM-as-a-Judge 模式。該顧問將自動評估 AI 響應並根據反饋驅動的改進重試失敗的嘗試:生成響應 → 評估質量 → 如果需要則根據反饋重試 → 重複直到達到質量閾值或達到重試限制。
讓我們檢查一下演示高階評估模式的實現
此實現演示了直接評估評估模式,其中評判模型使用逐點評分系統(1-4 分)評估單個響應。它將其與自我完善策略相結合,透過將特定反饋納入後續嘗試來自動重試失敗的評估,從而建立了一個迭代改進迴圈。
該顧問體現了兩個關鍵的 LLM-as-a-Judge 概念
(基於文章:使用 LLM-as-a-judge 🧑⚖️ 進行自動化和多功能評估)
public final class SelfRefineEvaluationAdvisor implements CallAdvisor {
private static final PromptTemplate DEFAULT_EVALUATION_PROMPT_TEMPLATE = new PromptTemplate(
"""
You will be given a user_question and assistant_answer couple.
Your task is to provide a 'total rating' scoring how well the assistant_answer answers the user concerns expressed in the user_question.
Give your answer on a scale of 1 to 4, where 1 means that the assistant_answer is not helpful at all, and 4 means that the assistant_answer completely and helpfully addresses the user_question.
Here is the scale you should use to build your answer:
1: The assistant_answer is terrible: completely irrelevant to the question asked, or very partial
2: The assistant_answer is mostly not helpful: misses some key aspects of the question
3: The assistant_answer is mostly helpful: provides support, but still could be improved
4: The assistant_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question
Provide your feedback as follows:
\\{
"rating": 0,
"evaluation": "Explanation of the evaluation result and how to improve if needed.",
"feedback": "Constructive and specific feedback on the assistant_answer."
\\}
Total rating: (your rating, as a number between 1 and 4)
Evaluation: (your rationale for the rating, as a text)
Feedback: (specific and constructive feedback on how to improve the answer)
You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.
Now here are the question and answer.
Question: {question}
Answer: {answer}
Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.
Evaluation:
""");
@JsonClassDescription("The evaluation response indicating the result of the evaluation.")
public record EvaluationResponse(int rating, String evaluation, String feedback) {}
@Override
public ChatClientResponse adviseCall(ChatClientRequest chatClientRequest, CallAdvisorChain callAdvisorChain) {
var request = chatClientRequest;
ChatClientResponse response;
// Improved loop structure with better attempt counting and clearer logic
for (int attempt = 1; attempt <= maxRepeatAttempts + 1; attempt++) {
// Make the inner call (e.g., to the evaluation LLM model)
response = callAdvisorChain.copy(this).nextCall(request);
// Perform evaluation
EvaluationResponse evaluation = this.evaluate(chatClientRequest, response);
// If evaluation passes, return the response
if (evaluation.rating() >= this.successRating) {
logger.info("Evaluation passed on attempt {}, evaluation: {}", attempt, evaluation);
return response;
}
// If this is the last attempt, return the response regardless
if (attempt > maxRepeatAttempts) {
logger.warn(
"Maximum attempts ({}) reached. Returning last response despite failed evaluation. Use the following feedback to improve: {}",
maxRepeatAttempts, evaluation.feedback());
return response;
}
// Retry with evaluation feedback
logger.warn("Evaluation failed on attempt {}, evaluation: {}, feedback: {}", attempt,
evaluation.evaluation(), evaluation.feedback());
request = this.addEvaluationFeedback(chatClientRequest, evaluation);
}
// This should never be reached due to the loop logic above
throw new IllegalStateException("Unexpected loop exit in adviseCall");
}
/**
* Performs the evaluation using the LLM-as-a-Judge and returns the result.
*/
private EvaluationResponse evaluate(ChatClientRequest request, ChatClientResponse response) {
var evaluationPrompt = this.evaluationPromptTemplate.render(
Map.of("question", this.getPromptQuestion(request), "answer", this.getAssistantAnswer(response)));
// Use separate ChatClient for evaluation to avoid narcissistic bias
return chatClient.prompt(evaluationPrompt).call().entity(EvaluationResponse.class);
}
/**
* Creates a new request with evaluation feedback for retry.
*/
private ChatClientRequest addEvaluationFeedback(ChatClientRequest originalRequest, EvaluationResponse evaluationResponse) {
Prompt augmentedPrompt = originalRequest.prompt()
.augmentUserMessage(userMessage -> userMessage.mutate().text(String.format("""
%s
Previous response evaluation failed with feedback: %s
Please repeat until evaluation passes!
""", userMessage.getText(), evaluationResponse.feedback())).build());
return originalRequest.mutate().prompt(augmentedPrompt).build();
}
}
遞迴模式實現 顧問使用 callAdvisorChain.copy(this).nextCall(request) 建立子鏈進行遞迴呼叫,從而實現多次評估,同時保持正確的顧問順序。
結構化評估輸出 使用 Spring AI 的結構化輸出功能,評估結果被解析為具有評分(1-4)、評估理由和具體改進反饋的 EvaluationResponse 記錄。
單獨的評估模型 使用專門的 LLM-as-a-Judge 模型(avcodes/flowaicom-flow-judge:q4)和不同的 ChatClient 例項來減輕模型偏差。設定 spring.ai.chat.client.enabled=false 以啟用 使用多個聊天模型。
反饋驅動的改進 失敗的評估會包含具體的反饋,這些反饋會融入重試嘗試中,使系統能夠從評估失敗中學習。
可配置的重試邏輯 支援可配置的最大嘗試次數,當達到評估限制時會優雅降級。
以下是將 SelfRefineEvaluationAdvisor 整合到完整的 Spring AI 應用程式中的方法
@SpringBootApplication
public class EvaluationAdvisorDemoApplication {
@Bean
CommandLineRunner commandLineRunner(AnthropicChatModel anthropicChatModel, OllamaChatModel ollamaChatModel) {
return args -> {
ChatClient chatClient = ChatClient.builder(anthropicChatModel) // @formatter:off
.defaultTools(new MyTools())
.defaultAdvisors(
SelfRefineEvaluationAdvisor.builder()
.chatClientBuilder(ChatClient.builder(ollamaChatModel)) // Separate model for evaluation
.maxRepeatAttempts(15)
.successRating(4)
.order(0)
.build(),
new MyLoggingAdvisor(2))
.build();
var answer = chatClient
.prompt("What is current weather in Paris?")
.call()
.content();
System.out.println(answer);
};
}
static class MyTools {
final int[] temperatures = {-125, 15, -255};
private final Random random = new Random();
@Tool(description = "Get the current weather for a given location")
public String weather(String location) {
int temperature = temperatures[random.nextInt(temperatures.length)];
System.out.println(">>> Tool Call responseTemp: " + temperature);
return "The current weather in " + location + " is sunny with a temperature of " + temperature + "°C.";
}
}
}
此配置使用 Anthropic Claude 進行生成,使用 Ollama 進行評估(避免偏差),要求評分達到 4 分,最多重試 15 次。它包括一個天氣工具,該工具生成隨機響應以觸發評估。weather 工具在 2/3 的情況下生成無效值。
SelfRefineEvaluationAdvisor(順序 0)評估響應質量並在需要時透過反饋重試,然後是 MyLoggingAdvisor(順序 2),它記錄最終的請求/響應以進行可觀察性。
執行時,您將看到如下輸出
REQUEST: [{"role":"user","content":"What is current weather in Paris?"}]
>>> Tool Call responseTemp: -255
Evaluation failed on attempt 1, evaluation: The response contains unrealistic temperature data, feedback: The temperature of -255°C is physically impossible and indicates a data error.
>>> Tool Call responseTemp: 15
Evaluation passed on attempt 2, evaluation: Excellent response with realistic weather data
RESPONSE: The current weather in Paris is sunny with a temperature of 15°C.
🚀 親自嘗試:包含配置示例(包括不同的模型組合和評估場景)的完整可執行演示可在 evaluation-recursive-advisor-demo 專案中找到。
Spring AI 的遞迴顧問使 LLM-as-a-Judge 模式的實現既優雅又可用於生產。SelfRefineEvaluationAdvisor 演示瞭如何構建自我改進的 AI 系統,該系統自動評估響應質量,透過反饋重試,並在無需人工干預的情況下擴充套件評估。
主要優點包括自動化質量控制、透過單獨的評判模型緩解偏差以及與現有 Spring AI 應用程式的無縫整合。這種方法為聊天機器人、內容生成和複雜 AI 工作流提供可靠、可擴充套件的質量保證基礎。
實施 LLM-as-a-Judge 技術的關鍵成功因素包括
⚠️ 重要提示
遞迴顧問是 Spring AI 1.1.0-M4+ 中的一個實驗性新功能。 目前,它們僅支援非流式傳輸,需要仔細的顧問排序,並且由於多次 LLM 呼叫而可能增加成本。
對維護外部狀態的內部顧問要特別小心——它們可能需要額外的關注以在迭代中保持正確性。
始終設定終止條件和重試限制以防止無限迴圈。
Spring AI 文件
LLM-as-a-Judge 研究