Skip to content

Commit 7137e8c

Browse files
committed
docs: Enhance README with AI evaluation validation section, including detailed results and key findings from low-code platform certification exams
1 parent 79e2d81 commit 7137e8c

File tree

2 files changed

+52
-0
lines changed

2 files changed

+52
-0
lines changed

README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,32 @@ GC-QA-RAG 是一个**企业级的检索增强生成(RAG)系统**。我们通
2525
- 🚀 **开箱即用的解决方案**:提供从知识库构建(ETL)、后端服务到前端界面的完整代码,并支持 Docker 一键部署,助力开发者快速搭建自己的高质量 RAG 系统。
2626
- 📚 **详尽的文档教程**:提供从产品设计、技术架构到落地经验的全方位文档,不仅是开源代码,更是一套可复用的实践方法论。
2727

28+
## 🧪 AI 测评验证
29+
30+
为验证系统的实际效果,我们让多个主流大语言模型参与了"活字格"低代码平台的认证考试。测评采用三种模式:直接生成答案、结合知识库检索(GC-QA-RAG)、Agent 自动规划检索(基于 GC-QA-RAG)。
31+
32+
**测评结果汇总**
33+
34+
| 考试科目 | 模型 | 直接生成答案 | 结合知识库检索 (RAG) | Agent 自动规划检索 | **最大提升** |
35+
| :------------------------------- | :------------------ | :----------- | :------------------- | :----------------- | :----------- |
36+
| **认证工程师-科目一 (基础)** | **Claude-4-sonnet** | 65.80% | 81.03% | **88.51%** | +22.71% |
37+
| | **GLM-4.5** | 61.21% | 84.20% | 87.07% | **+25.86%** |
38+
| | **Qwen3** | 67.82% | 83.05% | 85.92% | +18.10% |
39+
| **认证工程师-科目二 (实践)** | **Claude-4-sonnet** | 57.41% | 69.44% | **70.37%** | +12.96% |
40+
| | **GLM-4.5** | 47.22% | 64.81% | 65.74% | +18.52% |
41+
| | **Qwen3** | 51.85% | 65.74% | 68.52% | +16.67% |
42+
| **高级认证工程师-科目一 (高级)** | **Claude-4-sonnet** | 52.94% | 65.88% | **74.12%** | +21.18% |
43+
| | **GLM-4.5** | 57.65% | 67.06% | 68.24% | +10.59% |
44+
| | **Qwen3** | 54.12% | 61.18% | 68.24% | +14.12% |
45+
46+
**核心发现**
47+
48+
- **Agent 模式效果最显著**:在所有测试中,Agent 自动规划检索模式的得分都是最高的,最高提升达 25.86%
49+
- **RAG 技术显著提升准确率**:所有模型在获得外部知识库支持后,成绩都有大幅提高
50+
- **Claude-4-sonnet 综合表现最佳**:在三个科目的 Agent 模式下均取得了最高分
51+
52+
📖 **查看完整测评报告**[《让 LLM 做低代码考试,谁会胜出?》](./tools/gc-qa-rag-eval/让LLM做低代码考试谁会胜出.md)
53+
2854
## 📖 目录
2955

3056
- [快速开始](#-快速开始)

README_ENGLISH.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,32 @@ GC-QA-RAG is an **enterprise-grade Retrieval-Augmented Generation (RAG) system**
2121
- 🚀 **Ready-to-use Solution**: Provides complete code from knowledge base construction (ETL), backend services to frontend interfaces, with Docker one-click deployment support, helping developers quickly build high-quality RAG systems.
2222
- 📚 **Comprehensive Documentation**: Offers complete documentation covering product design, technical architecture, and implementation experience - not just open-source code, but a reusable methodology.
2323

24+
## 🧪 AI Evaluation Validation
25+
26+
To validate the system's actual effectiveness, we had multiple mainstream large language models participate in the "HuoZiGe" low-code platform certification exams. The evaluation used three modes: direct answer generation, knowledge base retrieval (GC-QA-RAG), and Agent automatic planning retrieval (based on GC-QA-RAG).
27+
28+
**Evaluation Results Summary**:
29+
30+
| Exam Subject | Model | Direct Generation | Knowledge Base Retrieval (RAG) | Agent Automatic Planning Retrieval | **Max Improvement** |
31+
| :----------------------------------------------------- | :------------------ | :---------------- | :----------------------------- | :--------------------------------- | :------------------ |
32+
| **Certified Engineer - Subject 1 (Fundamentals)** | **Claude-4-sonnet** | 65.80% | 81.03% | **88.51%** | +22.71% |
33+
| | **GLM-4.5** | 61.21% | 84.20% | 87.07% | **+25.86%** |
34+
| | **Qwen3** | 67.82% | 83.05% | 85.92% | +18.10% |
35+
| **Certified Engineer - Subject 2 (Practice)** | **Claude-4-sonnet** | 57.41% | 69.44% | **70.37%** | +12.96% |
36+
| | **GLM-4.5** | 47.22% | 64.81% | 65.74% | +18.52% |
37+
| | **Qwen3** | 51.85% | 65.74% | 68.52% | +16.67% |
38+
| **Advanced Certified Engineer - Subject 1 (Advanced)** | **Claude-4-sonnet** | 52.94% | 65.88% | **74.12%** | +21.18% |
39+
| | **GLM-4.5** | 57.65% | 67.06% | 68.24% | +10.59% |
40+
| | **Qwen3** | 54.12% | 61.18% | 68.24% | +14.12% |
41+
42+
**Key Findings**:
43+
44+
- **Agent mode shows the most significant results**: In all tests, Agent automatic planning retrieval mode achieved the highest scores, with maximum improvement of 25.86%
45+
- **RAG technology significantly improves accuracy**: All models showed substantial improvement after gaining external knowledge base support
46+
- **Claude-4-sonnet demonstrates the best overall performance**: Achieved the highest scores in Agent mode across all three subjects
47+
48+
📖 **View Complete Evaluation Report**: [《让 LLM 做低代码考试,谁会胜出?》](./tools/gc-qa-rag-eval/让LLM做低代码考试谁会胜出.md)
49+
2450
## 📖 Table of Contents
2551

2652
- [Quick Start](#-quick-start)

0 commit comments

Comments
 (0)