使用 Spring AI 進行 LLM 響應評估：利用遞迴顧問構建 LLM-as-a-Judge

工程 | Christian Tzolov | 2025 年 11 月 10 日 | ...

評估大型語言模型 (LLM) 輸出的挑戰對於眾所周知的不確定性 AI 應用程式至關重要，尤其是在它們投入生產時。

當評估現代 LLM 產生的細緻入微、上下文相關的響應時，ROUGE 和 BLEU 等傳統指標顯得力不從心。人工評估雖然準確，但成本高昂、速度慢且無法擴充套件。

引入 LLM-as-a-Judge —— 一種強大的技術，它使用 LLM 本身來評估 AI 生成內容的質量。研究表明，複雜的評判模型與人類判斷的一致性高達 85%，這實際上高於人與人之間的一致性（81%）。

在本文中，我們將探討 Spring AI 的 遞迴顧問（Recursive Advisors） 如何為實現 LLM-as-a-Judge 模式提供一個優雅的框架，使您能夠構建具有自動化質量控制的自我改進 AI 系統。要了解有關遞迴顧問 API 的更多資訊，請查閱我們之前的文章：使用 Spring AI 遞迴顧問建立自我改進的 AI 代理。

💡 演示：在 evaluation-recursive-advisor-demo 中找到完整的示例實現。

理解 LLM-as-a-Judge

LLM-as-a-Judge 是一種評估方法，其中大型語言模型（LLM）評估其他模型或其自身生成內容的質量。LLM-as-a-Judge 不僅僅依賴於人類評估員或傳統的自動化指標，而是利用 LLM 根據預定義的標準對響應進行評分、分類或比較。

為什麼它有效？ 評估本質上比生成更容易。當您使用 LLM 作為評判時，您要求它執行一個更簡單、更專注的任務（評估現有文字的特定屬性），而不是建立原創內容並平衡多個約束的複雜任務。一個很好的類比是，批評比創造更容易。發現問題比預防問題更簡單。

LLM-as-a-Judge 評估模式 主要有兩種

直接評估（逐點評分）：評判評估單個響應，提供可以透過自我完善來改進提示的反饋
成對比較：評判從兩個候選響應中選擇更好的一個（在 A/B 測試中很常見）

LLM 評判評估質量維度，例如相關性、事實準確性、對來源的忠實性、指令依從性以及跨醫療保健、金融、RAG 系統和對話等領域的整體連貫性和清晰度。

選擇合適的評判模型

雖然像 GPT-4 和 Claude 這樣的通用模型可以作為有效的評判，但專門的 LLM-as-a-Judge 模型在評估任務中始終優於它們。Judge Arena 排行榜跟蹤各種模型在評判任務中的表現。

Spring AI：完美的基石

Spring AI 的 ChatClient 提供了一個流式 API，非常適合實現 LLM-as-a-Judge 模式。其 Advisors 系統允許您以模組化、可重用的方式攔截、修改和增強 AI 互動。

最近引入的遞迴顧問（Recursive Advisors）透過啟用迴圈模式進一步擴充套件了這一點，這非常適合自我完善的評估工作流

public class MyRecursiveAdvisor implements CallAdvisor {
    
    @Override
    public ChatClientResponse adviseCall(ChatClientRequest request, CallAdvisorChain chain) {
        
        // Call the chain initially
        ChatClientResponse response = chain.nextCall(request);
        
        // Check if we need to retry based on evaluation
        while (!evaluationPasses(response)) {

            // Modify the request based on evaluation feedback
            ChatClientRequest modifiedRequest = addEvaluationFeedback(request, response);
            
            // Create a sub-chain and recurse
            response = chain.copy(this).nextCall(modifiedRequest);
        }
        
        return response;
    }
}

我們將實現一個 SelfRefineEvaluationAdvisor，它使用 Spring AI 的遞迴顧問來體現 LLM-as-a-Judge 模式。該顧問將自動評估 AI 響應並根據反饋驅動的改進重試失敗的嘗試：生成響應 → 評估質量 → 如果需要則根據反饋重試 → 重複直到達到質量閾值或達到重試限制。

讓我們檢查一下演示高階評估模式的實現

SelfRefineEvaluationAdvisor 的實現

此實現演示了直接評估評估模式，其中評判模型使用逐點評分系統（1-4 分）評估單個響應。它將其與自我完善策略相結合，透過將特定反饋納入後續嘗試來自動重試失敗的評估，從而建立了一個迭代改進迴圈。

該顧問體現了兩個關鍵的 LLM-as-a-Judge 概念

逐點評估：根據預定義標準，每個響應都會獲得一個單獨的質量分數
自我完善：失敗的響應會觸發重試嘗試，並提供建設性反饋以指導改進

（基於文章：使用 LLM-as-a-judge 🧑‍⚖️ 進行自動化和多功能評估）

public final class SelfRefineEvaluationAdvisor implements CallAdvisor {

    private static final PromptTemplate DEFAULT_EVALUATION_PROMPT_TEMPLATE = new PromptTemplate(
        """
        You will be given a user_question and assistant_answer couple.
        Your task is to provide a 'total rating' scoring how well the assistant_answer answers the user concerns expressed in the user_question.
        Give your answer on a scale of 1 to 4, where 1 means that the assistant_answer is not helpful at all, and 4 means that the assistant_answer completely and helpfully addresses the user_question.

        Here is the scale you should use to build your answer:
        1: The assistant_answer is terrible: completely irrelevant to the question asked, or very partial
        2: The assistant_answer is mostly not helpful: misses some key aspects of the question
        3: The assistant_answer is mostly helpful: provides support, but still could be improved
        4: The assistant_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question

        Provide your feedback as follows:

        \\{
            "rating": 0,
            "evaluation": "Explanation of the evaluation result and how to improve if needed.",
            "feedback": "Constructive and specific feedback on the assistant_answer."
        \\}

        Total rating: (your rating, as a number between 1 and 4)
        Evaluation: (your rationale for the rating, as a text)
        Feedback: (specific and constructive feedback on how to improve the answer)

        You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

        Now here are the question and answer.

        Question: {question}
        Answer: {answer}

        Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.

        Evaluation:
        """);

    @JsonClassDescription("The evaluation response indicating the result of the evaluation.")
    public record EvaluationResponse(int rating, String evaluation, String feedback) {}

    @Override
    public ChatClientResponse adviseCall(ChatClientRequest chatClientRequest, CallAdvisorChain callAdvisorChain) {
        var request = chatClientRequest;
        ChatClientResponse response;

        // Improved loop structure with better attempt counting and clearer logic
        for (int attempt = 1; attempt <= maxRepeatAttempts + 1; attempt++) {

            // Make the inner call (e.g., to the evaluation LLM model)
            response = callAdvisorChain.copy(this).nextCall(request);

            // Perform evaluation
            EvaluationResponse evaluation = this.evaluate(chatClientRequest, response);

            // If evaluation passes, return the response
            if (evaluation.rating() >= this.successRating) {
                logger.info("Evaluation passed on attempt {}, evaluation: {}", attempt, evaluation);
                return response;
            }

            // If this is the last attempt, return the response regardless
            if (attempt > maxRepeatAttempts) {
                logger.warn(
                    "Maximum attempts ({}) reached. Returning last response despite failed evaluation. Use the following feedback to improve: {}",
                    maxRepeatAttempts, evaluation.feedback());
                return response;
            }

            // Retry with evaluation feedback
            logger.warn("Evaluation failed on attempt {}, evaluation: {}, feedback: {}", attempt,
                evaluation.evaluation(), evaluation.feedback());

            request = this.addEvaluationFeedback(chatClientRequest, evaluation);
        }

        // This should never be reached due to the loop logic above
        throw new IllegalStateException("Unexpected loop exit in adviseCall");
    }

    /**
     * Performs the evaluation using the LLM-as-a-Judge and returns the result.
     */
    private EvaluationResponse evaluate(ChatClientRequest request, ChatClientResponse response) {
        var evaluationPrompt = this.evaluationPromptTemplate.render(
            Map.of("question", this.getPromptQuestion(request), "answer", this.getAssistantAnswer(response)));

        // Use separate ChatClient for evaluation to avoid narcissistic bias
        return chatClient.prompt(evaluationPrompt).call().entity(EvaluationResponse.class);
    }

    /**
     * Creates a new request with evaluation feedback for retry.
     */
    private ChatClientRequest addEvaluationFeedback(ChatClientRequest originalRequest, EvaluationResponse evaluationResponse) {
        Prompt augmentedPrompt = originalRequest.prompt()
            .augmentUserMessage(userMessage -> userMessage.mutate().text(String.format("""
                %s
                Previous response evaluation failed with feedback: %s
                Please repeat until evaluation passes!
                """, userMessage.getText(), evaluationResponse.feedback())).build());

        return originalRequest.mutate().prompt(augmentedPrompt).build();
    }
}

關鍵實現特性

遞迴模式實現 顧問使用 callAdvisorChain.copy(this).nextCall(request) 建立子鏈進行遞迴呼叫，從而實現多次評估，同時保持正確的顧問順序。

結構化評估輸出 使用 Spring AI 的結構化輸出功能，評估結果被解析為具有評分（1-4）、評估理由和具體改進反饋的 EvaluationResponse 記錄。

單獨的評估模型 使用專門的 LLM-as-a-Judge 模型（avcodes/flowaicom-flow-judge:q4）和不同的 ChatClient 例項來減輕模型偏差。設定 spring.ai.chat.client.enabled=false 以啟用使用多個聊天模型。

反饋驅動的改進 失敗的評估會包含具體的反饋，這些反饋會融入重試嘗試中，使系統能夠從評估失敗中學習。

可配置的重試邏輯 支援可配置的最大嘗試次數，當達到評估限制時會優雅降級。

整合

以下是將 SelfRefineEvaluationAdvisor 整合到完整的 Spring AI 應用程式中的方法

@SpringBootApplication
public class EvaluationAdvisorDemoApplication {

    @Bean
    CommandLineRunner commandLineRunner(AnthropicChatModel anthropicChatModel, OllamaChatModel ollamaChatModel) {
        return args -> {
            
            ChatClient chatClient = ChatClient.builder(anthropicChatModel) // @formatter:off
                    .defaultTools(new MyTools())
                    .defaultAdvisors(
                        
                        SelfRefineEvaluationAdvisor.builder()
                            .chatClientBuilder(ChatClient.builder(ollamaChatModel)) // Separate model for evaluation
                            .maxRepeatAttempts(15)
                            .successRating(4)
                            .order(0)
                            .build(),
                        
                        new MyLoggingAdvisor(2))
                .build(); 
                
            var answer = chatClient
                .prompt("What is current weather in Paris?")
                .call()
                .content();

            System.out.println(answer);
        };
    }

    static class MyTools {
        final int[] temperatures = {-125, 15, -255};
        private final Random random = new Random();
        
        @Tool(description = "Get the current weather for a given location")
        public String weather(String location) {
            int temperature = temperatures[random.nextInt(temperatures.length)];
            System.out.println(">>> Tool Call responseTemp: " + temperature);
            return "The current weather in " + location + " is sunny with a temperature of " + temperature + "°C.";
        }
    }
}

此配置使用 Anthropic Claude 進行生成，使用 Ollama 進行評估（避免偏差），要求評分達到 4 分，最多重試 15 次。它包括一個天氣工具，該工具生成隨機響應以觸發評估。weather 工具在 2/3 的情況下生成無效值。

SelfRefineEvaluationAdvisor（順序 0）評估響應質量並在需要時透過反饋重試，然後是 MyLoggingAdvisor（順序 2），它記錄最終的請求/響應以進行可觀察性。

執行時，您將看到如下輸出

REQUEST: [{"role":"user","content":"What is current weather in Paris?"}]

>>> Tool Call responseTemp: -255
Evaluation failed on attempt 1, evaluation: The response contains unrealistic temperature data, feedback: The temperature of -255°C is physically impossible and indicates a data error.
 
>>> Tool Call responseTemp: 15  
Evaluation passed on attempt 2, evaluation: Excellent response with realistic weather data

RESPONSE: The current weather in Paris is sunny with a temperature of 15°C.

🚀 親自嘗試：包含配置示例（包括不同的模型組合和評估場景）的完整可執行演示可在 evaluation-recursive-advisor-demo 專案中找到。

結論

Spring AI 的遞迴顧問使 LLM-as-a-Judge 模式的實現既優雅又可用於生產。SelfRefineEvaluationAdvisor 演示瞭如何構建自我改進的 AI 系統，該系統自動評估響應質量，透過反饋重試，並在無需人工干預的情況下擴充套件評估。

主要優點包括自動化質量控制、透過單獨的評判模型緩解偏差以及與現有 Spring AI 應用程式的無縫整合。這種方法為聊天機器人、內容生成和複雜 AI 工作流提供可靠、可擴充套件的質量保證基礎。

實施 LLM-as-a-Judge 技術的關鍵成功因素包括

使用專用評判模型以獲得更好的效能（Judge Arena 排行榜）
透過單獨的生成/評估模型來減輕偏差
確保確定性結果（temperature = 0）
使用整數刻度和少量示例來設計提示
對高風險決策保持人工監督

⚠️ 重要提示

遞迴顧問是 Spring AI 1.1.0-M4+ 中的一個實驗性新功能。 目前，它們僅支援非流式傳輸，需要仔細的顧問排序，並且由於多次 LLM 呼叫而可能增加成本。

對維護外部狀態的內部顧問要特別小心——它們可能需要額外的關注以在迭代中保持正確性。

始終設定終止條件和重試限制以防止無限迴圈。

資源

Spring AI 文件

LLM-as-a-Judge 研究

Judge Arena 排行榜 - 最佳評判模型的當前排名
用 MT-Bench 和 Chatbot Arena 評判 LLM-as-a-Judge - 介紹 LLM-as-a-Judge 正規化的奠基性論文
評判的裁決：透過人類一致性對 LLM 評判能力的全面分析 - 引入了一個兩步基準測試，透過測試 54 個 LLM 作為評判與人類判斷和一致性模式的相關性來評估其效能，揭示了 27 個模型無論大小都能透過類人或超一致的判斷行為達到頂級效能。
LLMs-as-Judges：基於 LLM 的評估方法綜合調查
從生成到判斷：LLM-as-a-judge 的機遇與挑戰 (2024) - 涵蓋 LLM-as-a-Judge 完整格局的調查，具有系統分類和最新挑戰
LLM-as-a-Judge 資源中心 - 包含論文列表、工具和正在進行的研究的中央儲存庫
偏好洩露：LLM-as-a-judge 中的汙染問題 - 關於評判模型偏差的最新研究
誰是你的評判？關於 LLM 生成判斷的可檢測性 - 關於判斷檢測和透明度的新興研究

Spring 部落格

使用 Spring AI 進行 LLM 響應評估：利用遞迴顧問構建 LLM-as-a-Judge

理解 LLM-as-a-Judge

選擇合適的評判模型

Spring AI：完美的基石

SelfRefineEvaluationAdvisor 的實現

關鍵實現特性

整合

結論

⚠️ 重要提示

資源

獲取 Spring 新聞通訊

領先一步

獲得支援

即將舉行的活動