uni-gram与bi-gram语言模型_uni-gram和bi-gram

作者：weixin_40725706 | 2024-04-10 07:47:31

踩

uni-gram和bi-gram

实验内容

用python编程实践语言模型（uni-gram和bi-gram）,加入平滑技术。
计算test.txt中句子的PPL，对比uni-gram和bi-gram语言模型效果。

遇到和解决的问题

问题1

问题：列表和字典作为实参传入函数时，在函数体内部改变形参，会导致实参也发生改变
解决：一维列表传入使用list.copy()，二维字典传入使用copy.deepcopy(dict)
详情可见：Python中实参随形参改变而改变的问题_长命百岁️的博客-CSDN博客。该博客编写于实验过程中，针对本问题进行了解答。

问题2

问题：程序运算速度很慢，尤其是bi-gram阶段
解答：原本的程序是每次向函数中传入一个测试语句，然后对整体词典进行一次平滑处理操作。且使用实参传入时需要使用copy函数。尤其是在进行bi-gram过程时，需要使用deepcopy函数，非常慢。之后直接对整个test文本进行一次平滑操作。前后结果基本没有发生变化

问题3

问题：bi-gram的概率计算方式和平滑处理方式不对，导致困惑度偏大
解答：
- 一开始尝试过两种存储方式，句子概率计算方式均为P(abc) = P(a)P(b|a)P(c|b)
  - 第一种：{a: {b: 1, c: 2}}，当进行平滑处理时，令b,c都加一，也就是{a: {b: 2, c: 3}}，这样的话，a出现的次数b + c就从1+2变成2+3了，对单个a来说就不是加一平滑了
  - 第二种：{a: {a: 3, b: 1, c: 2}}，将a存入a的子项中。进行平滑处理时，令a，b，c都加一，也就是{a: {a: 4, b: 2, c: 3}}，这样的话a = b + c 就不成立了
- 给句子添加首位，令 abc变成begin abc end。这样的话，我们统计的时候，就可以仅统计 P(a|b)形式的词频就好了，因为P(begin abc end) = P(begin)P(a|begin)P(b|a)P(c|b)P(end|c)。一句话的开头，P(begin) = 1。所以可以写成P(begin abc end) = P(a|begin)P(b|a)P(c|b)P(end|c)。全是P(a|b)形式，容易平滑和处理

实验步骤

数据预处理

因为train和test都是以文本的形式给出的，而我们利用语言模型生成句子是以词项为基本单位的。因此，我们需要从文本中提取词项，以构建语言模型和测试语言模型

needless_words = ['!', ',', '.', '?', ':', ';', '<', '>']  # 常见标点符号
1

读取训练数据

输入：字符串，文件路径
输出：二维列表, 每个列表中存储一句话提取出来的词

def read_train_file(file_path):  # 返回的是所有的词,格式是二维列表,每句的词组成一个列表,
    res_list = []
    with open(file_path, 'r', encoding='utf-8') as f:
        reader = f.readlines()
        for line in reader:
            split_line = line.strip().split('__eou__')  # 分句
            words = [nltk.word_tokenize(each_line) for each_line in split_line]  # 分词,每句的词都是一个列表
            for i in words:
                need_word = [word.lower() for word in i if word not in needless_words]  # 删除常见标点,并将所有词进行小写处理
                if len(need_word) > 0:
                    res_list.append(need_word)
    return res_list
1
2
3
4
5
6
7
8
9
10
11
12

读取测试数据

输入：字符串，文件路径
输出：二维列表，每个列表存储一句话（未经过词项提取）

def read_test_file(file_path):  # 返回的是所有的句子,格式是二维列表,每个句子都是一个列表
    res_list = []
    with open(file_path, 'r', encoding='utf-8') as f:
        reader = f.readlines()
        for line in reader:
            split_line = line.strip().split('__eou__')  # 分句
            for i in split_line:
                if len(i) > 0:
                    res_list.append(i)
    return res_list
1
2
3
4
5
6
7
8
9
10

uni-gram

词项统计（train）

输入：二维列表，就是read_train_file函数的输出内容
输出：字典，格式为{a: 10}，含义是：a在训练数据中出现了 10 次
注意：我们同时计算了训练数据的总词量total_words。因为后面要算的是一个词出现的概率，total_words可作为分母

def uni_gram(word_list):  # 计算词频，返回的是一个保留词频的字典，word_list格式是二维列表
    global total_words
    uni_dict = defaultdict(float)
    for line in word_list:
        for word in line:
            total_words += 1  # 计算总的词的个数
            uni_dict[word] += 1  # 计算词频
    return uni_dict
1
2
3
4
5
6
7
8

加一平滑 + 困惑度计算
- 输入：word_dict 字典，就是uni_gram函数的输出内容。sens 二维列表，就是read_test_file函数的输出内容
- 输出：列表，存储了测试数据中每个句子的困惑度
- 加一平滑：
  - 为防止测试数据中出现了训练数据中从未出现的词，而导致一句话出现的概率为 0。我们在遍历训练数据时，当遇到字典中没有出现过的词时，我们将其添加到字典中，令其出现的次数为 0。
  - 之后我们对字典中的所有词的词项都加 1。
  - 计算每个词出现的概率： $C(w_i)$ 为原词频， $N$ 为训练数据总词数， $V$ 是新增加的 1 的个数
  - 计算句子出现的概率： $P (a b c d) = P (a) P (b) P (c) P (d)$ ，这里为了减小误差，我们将累乘变成 $l o g$ 累加的形式
  - 计算困惑度
```
def ppl_compute(word_dict, sens):  # word_dict是存储词频的字典, sen是没经过分词的一个 test 句子
    temp = []
    for sen in sens:
        words = nltk.word_tokenize(sen)
        need_words = [word.lower() for word in words if word not in needless_words]  # 提取出句子中所有的词项
        temp.append(need_words)
        for word in need_words:  # test语句中未在 train 时出现过的词，新加入
            if word not in word_dict:
                word_dict[word] = 0

    for word in word_dict:  # 所有词项的词频都加 1，进行平滑处理
        word_dict[word] += 1
        word_dict[word] /= len(word_dict) + total_words  # 每个词都加一后的增加量 + 原有的词的总数

    for need_words in temp:
        res_ppl = 1
        for word in need_words:
            res_ppl += log(word_dict[word], 2)  # 防止累乘出现 res_ppl = 0 的情况
        uni_ppl.append(pow(2, -(res_ppl / len(need_words))))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
```

bi-gram

词项统计（train）

输入：二维列表，就是read_train_file函数的输出内容
输出：二维字典，格式为{a: {b: 1, c: 2}}，含义：在a出现的情况下，b出现 1 次，c出现 2 次
注意：这里我们加了开头和结尾，具体原因会在遇到和解决的问题中阐述。

def bi_gram(word_list):  # 统计 bi_gram 的词频,返回一个二维字典
    bi_dict = defaultdict(dict)
    for words in word_list:
        words.insert(0, 'nsy6666')  # 每行的词加个开头
        words.append('nsy6666///')  # 每行的词加个结尾
        for index in range(len(words) - 1):
            if words[index + 1] not in bi_dict[words[index]]:  # 其他词作为子项
                bi_dict[words[index]][words[index + 1]] = 1
            else:
                bi_dict[words[index]][words[index + 1]] += 1
    return bi_dict
1
2
3
4
5
6
7
8
9
10
11

加一平滑 + 困惑度计算
- 输入：bi_word 字典，就是bi_gram函数的输出内容。sens 二维列表，就是read_test_file函数的输出内容
- 输出：列表，存储了测试数据中每个句子的困惑度
- 加一平滑：
  - 为防止测试数据中出现了训练数据中从未出现的词对，而导致一句话出现的概率为 0。我们在遍历训练数据时，当遇到字典中没有出现过的词对时，我们将其添加到字典中，令其出现的次数为 0。
  - 计算 $P (a ∣ b)$ ，b出现的情况下，a出现的概率。就是bi_word[b][a] / b出现的次数。
  - 计算句子出现的概率： $P (a b c) = P (a ∣ b e g i n) P (b ∣ a) P (c ∣ b) P (e n d ∣ c)$
  - 和uni-gram一样计算困惑度
```
def ppl_compute_bi(bi_word, sens):
    temp = []
    for sen in sens:  # 遍历每个句子
        words = nltk.word_tokenize(sen)
        need_words = [word.lower() for word in words if word not in needless_words]  # 提取出句子中所有的词项
        need_words.insert(0, 'nsy6666')  # 每行的词加个开头
        need_words.append('nsy6666///')  # 每行的词加个结尾
        temp.append(need_words)

        for index in range(len(need_words) - 1):  # 添加 test 句子中同时出现的 bi_gram,但未在 train 中同时出现的 bi_gram
            if need_words[index + 1] not in bi_word[need_words[index]]:
                bi_word[need_words[index]][need_words[index + 1]] = 0

    for first_word in bi_word:  # 对 bi_gram 词项进行平滑处理
        for second_word in bi_word[first_word]:
            bi_word[first_word][second_word] += 1

    for first_word in bi_word:  # 对 bi_gram 词项进行平滑处理。不能只使用 need_words,因为中间有很多重复的词,会进行不该进行的除法
        tt = sum(bi_word[first_word].values())  # 需要提前定义在这里，否则后面进行除法之后，这个值就发生改变了
        for second_word in bi_word[first_word]:
            bi_word[first_word][second_word] /= tt

    for need_words in temp:
        res_ppl = 0
        for index in range(len(need_words) - 1):
            res_ppl += log(bi_word[need_words[index]][need_words[index + 1]], 2)
        bi_ppl.append(pow(2, -(res_ppl / (len(need_words) - 1))))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
```

实验结果

uni-gram

print(sum(uni_ppl) / len(uni_ppl))  # 取所有句子困惑度的平均值
>>723.2634736604283
1
2

bi-gram

print(sum(bi_ppl) / len(bi_ppl))  # 取所有句子困惑度的平均值
>>51.02679427126319
1
2

github: 代码及数据地址

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/weixin_40725706/article/detail/397411