（1）给ChatGLM添加先验信息-使用text2vec添加先验信息_text2vec chatglm

作者：我家小花儿 | 2024-03-17 09:45:36

踩

text2vec chatglm

前言

我们在使用ChatGPT进行询问的时候，问出来的都是一些通用的知识，但对于自己私有领域的知识却没有一个很好的考虑。现在虽然也有一些方案可以把私有领域的知识以先验信息的形式加到ChatGPT上面，但是毕竟openAI不open，你发给他的任何数据，都会被openAI收集过去用来作为之后训练的材料。既然是私有领域的知识，那么肯定不想被openAI记录，所以我们这里考虑使用本地的ChatGLM。

效果如下

具体ChatGLM的搭建方法不是本文重点介绍对象，具体可以参考下面的教程
GitHub - THUDM/ChatGLM-6B: ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型

思路

对ChatGLM添加先验知识的信息
做法很简单，对原始文本以及想要搜索的文本进行向量化，然后比较各段文本在向量空间中的相似度找到相似度最高的那几个文本，之后把需要添加的信息以history的形式添加到网络模型之中

代码

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained(".\\models\\chatglm-6b-int4", trust_remote_code=True, revision="")
model = AutoModel.from_pretrained(".\\models\\chatglm-6b-int4", trust_remote_code=True, revision="").half().cuda()
# tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True)
# model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True, revision="").half().cuda()

# model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True)



# kernel_file = "./models/chatglm-6b-int4/quantization_kernels.so"
# model = model.quantize(bits=4, kernel_file=kernel_file)
model = model.quantize(bits=4)
model = model.eval()

def parse_text(text):
    lines = text.split("\n")
    lines = [line for line in lines if line != ""]
    count = 0
    for i, line in enumerate(lines):
        if "```" in line:
            count += 1
            items = line.split('`')
            if count % 2 == 1:
                lines[i] = f'<pre><code class="language-{items[-1]}">'
            else:
                lines[i] = f'<br></code></pre>'
        else:
            if i > 0:
                if count % 2 == 1:
                    line = line.replace("`", "\`")
                    line = line.replace("<", "&lt;")
                    line = line.replace(">", "&gt;")
                    line = line.replace(" ", "&nbsp;")
                    line = line.replace("*", "&ast;")
                    line = line.replace("_", "&lowbar;")
                    line = line.replace("-", "&#45;")
                    line = line.replace(".", "&#46;")
                    line = line.replace("!", "&#33;")
                    line = line.replace("(", "&#40;")
                    line = line.replace(")", "&#41;")
                    line = line.replace("$", "&#36;")
                lines[i] = "<br>"+line
    text = "".join(lines)
    return text

def predict(input, chatbot, max_length, top_p, temperature, history):
    chatbot.append((parse_text(input), ""))
    for response, history in model.stream_chat(tokenizer, input, history, max_length=max_length, top_p=top_p,
                                               temperature=temperature):
        chatbot[-1] = (parse_text(input), parse_text(response))

        yield chatbot, history

def text2ver_search(file_name,search_text,limit=1):      #使用text2ver进行单个文件的语义搜索

    from docarray import Document, DocumentArray
    from text2vec import SentenceModel, EncoderType
    from tqdm import tqdm

    with open(file_name, encoding='utf-8') as f:
        txt = f.read()

    document = Document(text=txt)
    document_array = DocumentArray(
        Document(text=s.strip()) for s in document.text.split('\n') if s.strip())  # 按照换行进行分割字符串
    model = SentenceModel("shibing624/text2vec-base-chinese", encoder_type=EncoderType.FIRST_LAST_AVG, device='cpu')
    feature_vec = model.encode
    for document in tqdm(document_array):
        document.embedding = feature_vec(document.text)
    text = Document(text=search_text)  # 要匹配的文本
    text.embedding = feature_vec(text.text)
    querys = text.match(document_array, limit=limit, exclude_self=True, metric='cos', use_scipy=True)  # 找到与输入的文本最相似的句子

    querys_text = querys.matches[:, ('text')]


    querys_list = []
    for query_text in querys_text:
        temp = (search_text, query_text)
        querys_list.append(temp)

    return querys_list


file_name ='./data/test.txt'
search_text = '安心的老婆是谁？'
querys_list = text2ver_search(file_name,search_text,1)
print("querys_list:", querys_list)
response_new = ''
history = querys_list

for chatbot, history in  predict(search_text, chatbot=[], max_length=10000, top_p=0.5, temperature=0.5, history=history):
    response_old = response_new
    response_new = chatbot[0][1]
    new_single = response_new.replace(response_old, '')
    print(new_single,end='')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98

结语

这里的文档可以替换成其他的PDF、word文档，也可以先对这些文件批量建立索引保存下来，之后再读取对应的索引，输入到ChatGLM中，具体玩法就看各位自己的了。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/我家小花儿/article/detail/255383