docs: Enhance README with AI evaluation validation section, including detailed results and key findings from low-code platform certification exams

experdot · experdot · commit 7137e8c3ad51 · 2025-08-15T11:25:11.000+08:00
diff --git a/README.md b/README.md
@@ -25,6 +25,32 @@ GC-QA-RAG 是一个**企业级的检索增强生成（RAG）系统**。我们通
 -   🚀 **开箱即用的解决方案**：提供从知识库构建（ETL）、后端服务到前端界面的完整代码，并支持 Docker 一键部署，助力开发者快速搭建自己的高质量 RAG 系统。
 -   📚 **详尽的文档教程**：提供从产品设计、技术架构到落地经验的全方位文档，不仅是开源代码，更是一套可复用的实践方法论。
 
+## 🧪 AI 测评验证
+
+为验证系统的实际效果，我们让多个主流大语言模型参与了"活字格"低代码平台的认证考试。测评采用三种模式：直接生成答案、结合知识库检索（GC-QA-RAG）、Agent 自动规划检索（基于 GC-QA-RAG）。
+
+**测评结果汇总**：
+
+| 考试科目                         | 模型                | 直接生成答案 | 结合知识库检索 (RAG) | Agent 自动规划检索 | **最大提升** |
+| :------------------------------- | :------------------ | :----------- | :------------------- | :----------------- | :----------- |
+| **认证工程师-科目一 (基础)**     | **Claude-4-sonnet** | 65.80%       | 81.03%               | **88.51%**         | +22.71%      |
+|                                  | **GLM-4.5**         | 61.21%       | 84.20%               | 87.07%             | **+25.86%**  |
+|                                  | **Qwen3**           | 67.82%       | 83.05%               | 85.92%             | +18.10%      |
+| **认证工程师-科目二 (实践)**     | **Claude-4-sonnet** | 57.41%       | 69.44%               | **70.37%**         | +12.96%      |
+|                                  | **GLM-4.5**         | 47.22%       | 64.81%               | 65.74%             | +18.52%      |
+|                                  | **Qwen3**           | 51.85%       | 65.74%               | 68.52%             | +16.67%      |
+| **高级认证工程师-科目一 (高级)** | **Claude-4-sonnet** | 52.94%       | 65.88%               | **74.12%**         | +21.18%      |
+|                                  | **GLM-4.5**         | 57.65%       | 67.06%               | 68.24%             | +10.59%      |
+|                                  | **Qwen3**           | 54.12%       | 61.18%               | 68.24%             | +14.12%      |
+
+**核心发现**：
+
+-   **Agent 模式效果最显著**：在所有测试中，Agent 自动规划检索模式的得分都是最高的，最高提升达 25.86%
+-   **RAG 技术显著提升准确率**：所有模型在获得外部知识库支持后，成绩都有大幅提高
+-   **Claude-4-sonnet 综合表现最佳**：在三个科目的 Agent 模式下均取得了最高分
+
+📖 **查看完整测评报告**：[《让 LLM 做低代码考试，谁会胜出？》](./tools/gc-qa-rag-eval/让LLM做低代码考试谁会胜出.md)
+
 ## 📖 目录
 
 -   [快速开始](#-快速开始)
diff --git a/README_ENGLISH.md b/README_ENGLISH.md
@@ -21,6 +21,32 @@ GC-QA-RAG is an **enterprise-grade Retrieval-Augmented Generation (RAG) system**
 -   🚀 **Ready-to-use Solution**: Provides complete code from knowledge base construction (ETL), backend services to frontend interfaces, with Docker one-click deployment support, helping developers quickly build high-quality RAG systems.
 -   📚 **Comprehensive Documentation**: Offers complete documentation covering product design, technical architecture, and implementation experience - not just open-source code, but a reusable methodology.
 
+## 🧪 AI Evaluation Validation
+
+To validate the system's actual effectiveness, we had multiple mainstream large language models participate in the "HuoZiGe" low-code platform certification exams. The evaluation used three modes: direct answer generation, knowledge base retrieval (GC-QA-RAG), and Agent automatic planning retrieval (based on GC-QA-RAG).
+
+**Evaluation Results Summary**:
+
+| Exam Subject                                           | Model               | Direct Generation | Knowledge Base Retrieval (RAG) | Agent Automatic Planning Retrieval | **Max Improvement** |
+| :----------------------------------------------------- | :------------------ | :---------------- | :----------------------------- | :--------------------------------- | :------------------ |
+| **Certified Engineer - Subject 1 (Fundamentals)**      | **Claude-4-sonnet** | 65.80%            | 81.03%                         | **88.51%**                         | +22.71%             |
+|                                                        | **GLM-4.5**         | 61.21%            | 84.20%                         | 87.07%                             | **+25.86%**         |
+|                                                        | **Qwen3**           | 67.82%            | 83.05%                         | 85.92%                             | +18.10%             |
+| **Certified Engineer - Subject 2 (Practice)**          | **Claude-4-sonnet** | 57.41%            | 69.44%                         | **70.37%**                         | +12.96%             |
+|                                                        | **GLM-4.5**         | 47.22%            | 64.81%                         | 65.74%                             | +18.52%             |
+|                                                        | **Qwen3**           | 51.85%            | 65.74%                         | 68.52%                             | +16.67%             |
+| **Advanced Certified Engineer - Subject 1 (Advanced)** | **Claude-4-sonnet** | 52.94%            | 65.88%                         | **74.12%**                         | +21.18%             |
+|                                                        | **GLM-4.5**         | 57.65%            | 67.06%                         | 68.24%                             | +10.59%             |
+|                                                        | **Qwen3**           | 54.12%            | 61.18%                         | 68.24%                             | +14.12%             |
+
+**Key Findings**:
+
+-   **Agent mode shows the most significant results**: In all tests, Agent automatic planning retrieval mode achieved the highest scores, with maximum improvement of 25.86%
+-   **RAG technology significantly improves accuracy**: All models showed substantial improvement after gaining external knowledge base support
+-   **Claude-4-sonnet demonstrates the best overall performance**: Achieved the highest scores in Agent mode across all three subjects
+
+📖 **View Complete Evaluation Report**: [《让 LLM 做低代码考试，谁会胜出？》](./tools/gc-qa-rag-eval/让LLM做低代码考试谁会胜出.md)
+
 ## 📖 Table of Contents
 
 -   [Quick Start](#-quick-start)