RAG 性能优化 (Optimization)

在 RAG 系统从 Demo 走向生产的过程中，性能是最大的瓶颈。 Golang 的高并发特性在这里可以发挥巨大作用。本章将介绍几种核心的优化策略。

1. 语义缓存 (Semantic Caching)

LLM 的调用既贵又慢。如果用户问了相似的问题，我们应该直接返回之前的答案，而不是重新走一遍 RAG 流程。

传统缓存（Key=Query String）命中率极低，因为用户每次问法略有不同。 语义缓存（Key=Query Embedding）通过计算向量相似度来命中缓存。

Golang 实现思路

计算 Query 的向量。
在 Redis (支持 Vector) 或本地缓存中查找相似度 > 0.95 的历史提问。
如果命中，直接返回历史答案。

package main

import (
	"context"
	"fmt"
	"time"

	"github.com/redis/go-redis/v9"
)

// SemanticCache 简单的语义缓存封装
type SemanticCache struct {
	rdb *redis.Client
}

func (c *SemanticCache) Get(ctx context.Context, queryVec []float32) (string, bool) {
	// 假设 Redis 已安装 RediSearch 模块
	// 这里使用伪代码表示向量搜索逻辑
	// CMD: FT.SEARCH idx "*=>[KNN 1 @vector $vec AS score]" RETURN 1 answer
	
	// 模拟命中
	return "缓存的答案", false
}

func (c *SemanticCache) Set(ctx context.Context, queryVec []float32, answer string) {
	// 存入 Redis
}

// 中间件逻辑
func CacheMiddleware(next func(string) string) func(string) string {
	return func(query string) string {
		// 1. 计算向量
		vec := embed(query)
		
		// 2. 查缓存
		if ans, hit := cache.Get(vec); hit {
			fmt.Println("命中语义缓存！")
			return ans
		}
		
		// 3. 执行原逻辑
		ans := next(query)
		
		// 4. 写入缓存 (异步)
		go cache.Set(vec, ans)
		
		return ans
	}
}

2. 异步数据处理 (Async Ingestion)

当用户上传一个 100MB 的 PDF 时，如果同步进行 Parsing -> Chunking -> Embedding -> Indexing，接口会超时。必须使用 Worker Pool 模式进行异步处理。

package main

import "fmt"

type Job struct {
	FilePath string
}

func Worker(id int, jobs <-chan Job, results chan<- string) {
	for j := range jobs {
		fmt.Printf("Worker %d 开始处理文件: %s\n", id, j.FilePath)
		// 模拟耗时操作: Load -> Split -> Embed -> Store
		processFile(j.FilePath)
		results <- fmt.Sprintf("文件 %s 处理完成", j.FilePath)
	}
}

func processFile(path string) {
	// RAG 索引逻辑...
}

func main() {
	jobs := make(chan Job, 100)
	results := make(chan string, 100)

	// 启动 5 个 Worker
	for w := 1; w <= 5; w++ {
		go Worker(w, jobs, results)
	}

	// 发送任务
	jobs <- Job{FilePath: "doc1.pdf"}
	jobs <- Job{FilePath: "doc2.pdf"}
	close(jobs)

	// 获取结果
	for a := 1; a <= 2; a++ {
		<-results
	}
}

3. 上下文压缩 (Context Compression)

检索到的文档可能很长，直接喂给 LLM 会消耗大量 Token 且引入噪音。我们可以使用一个小模型（如 BERT）或者规则，对检索结果进行压缩。

策略：

LLMLingua: 使用专门的模型压缩 Prompt。
关键词过滤: 只保留包含 Query 关键词的句子。

Golang 简单实现（基于句子过滤）：

import "strings"

func CompressContext(query string, docs []string) []string {
	keywords := ExtractKeywords(query) // 提取关键词
	var compressed []string
	
	for _, doc := range docs {
		sentences := strings.Split(doc, "。")
		for _, sent := range sentences {
			for _, kw := range keywords {
				if strings.Contains(sent, kw) {
					compressed = append(compressed, sent)
					break
				}
			}
		}
	}
	return compressed
}

4. 总结

Golang 在 RAG 优化中的核心价值：

高并发: 轻松处理海量文档的并发索引。
低延迟: 编译型语言，接口响应速度快。
强类型: 复杂的流水线代码更易维护。

RAG 性能优化 (Optimization) ​

1. 语义缓存 (Semantic Caching) ​

Golang 实现思路 ​

2. 异步数据处理 (Async Ingestion) ​

3. 上下文压缩 (Context Compression) ​

4. 总结 ​

🚀 学习遇到瓶颈？想进大厂？