编程革命者

这个屌丝很懒，什么也没留下！

热门标签

【NLP】8中文语句情感分析实战——酒店、微博、外卖、网购等九个数据集处理、SVM和SGD训练_8个中文领域的情感分析数据集

作者：编程革命者 | 2024-01-30 15:18:43

踩

8个中文领域的情感分析数据集

一、情感分析数据集处理

1. NLPCC 2014会议技术评测测试数据与答案

第二届自然语言处理与中文计算会议（NLP&CC 2013），大小：10 000 条微博，而且与2014年的是重复的，所以使用2014年会议的数据

NLPCC 2014 Evaluation Tasks Test Data，大小：14 000 条微博，45 421句子，网站

微博语料，标注了7 emotions: like, disgust, happiness, sadness, anger, surprise, fear

# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup

def emotion_convert(string):
    dictionary = {
        'like':'POS',
        'disgust':'NEG',
        'happiness':'POS',
        'sadness':'NEG',
        'anger':'NEG',
        'surprise':'POS',
        'fear':'NEG'
    }
    return dictionary.get(string, None)

NLPCC_2014_path = '...your path/NLPCC/evtestdata1/Training data for Emotion Classification.xml'
out_path = '...your path/test/NLPCC_2014.txt'

file = open(NLPCC_2014_path, 'r', encoding='utf-8')
txt = file.read()
file.close()
file = open(out_path, 'a', encoding='utf-8')

soup = BeautifulSoup(txt,'html.parser')

for tag in soup.find_all('sentence'):
    file.write(tag.string + ' ')
    if tag.attrs['opinionated'] == 'N':
        file.write('NORM\n')
    elif tag.attrs['opinionated'] == 'Y':
        file.write(emotion_convert(tag.attrs['emotion-1-type'])+'\n')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

2. 酒店评论数据ChnSentiCorp_htl_all

参考此，7000 多条酒店评论数据，5000 多条正向评论，2000 多条负向评论

关于pandas库遍历数据集用法可参考此

import pandas as pd

def emotion_convert(string):
    dictionary = {
        1:'POS',
        0:'NEG'
    }
    return dictionary.get(string, None)

path = '...your path/情感观点评论 倾向性分析/ChnSentiCorp_htl_all/'
pd_all = pd.read_csv(path + 'ChnSentiCorp_htl_all.csv')

print('评论数目（总体）：%d' % pd_all.shape[0])
print('评论数目（正向）：%d' % pd_all[pd_all.label==1].shape[0])
print('评论数目（负向）：%d' % pd_all[pd_all.label==0].shape[0])

# print(pd_all.sample(2))

# 构造平衡语料

out_path = '...your path/test/ChnSentiCorp_htl_all.txt'
file = open(out_path, 'a', encoding='utf-8')

for row in pd_all.itertuples():
    # print(emotion_convert(getattr(row, 'label')),getattr(row, 'review'))
    try:
        file.write(getattr(row, 'review') + ' ' + emotion_convert(getattr(row, 'label')) + '\n')
    except:
        print('Error!')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

结果：

评论数目（总体）：7766
评论数目（正向）：5322
评论数目（负向）：2444
Error!
1
2
3
4

3. 外卖平台用户评价waimai_10k

参考此，某外卖平台收集的用户评价，正向 4000 条，负向约 8000 条

代码同上，需要修改的：

path = '...your path/情感观点评论 倾向性分析/waimai_10k/'
pd_all = pd.read_csv(path + 'waimai_10k.csv')

out_path = '...your path/test/waimai_10k.txt'
1
2
3
4

结果：

评论数目（总体）：11987
评论数目（正向）：4000
评论数目（负向）：7987
1
2
3

4. 线上购物评论数据online_shopping_10_cats

参考此，10 个类别，共 6 万多条评论数据，正、负向评论各约 3 万条，包括书籍、平板、手机、水果、洗发水、热水器、蒙牛、衣服、计算机、酒店

代码同上，结果：

评论数目（总体）：62774
评论数目（正向）：31728
评论数目（负向）：31046
Error!
1
2
3
4

5. 新浪微博情感标注weibo_senti_100k

参考此，10 万多条，带情感标注新浪微博，正负向评论约各 5 万条

代码同上，结果：

评论数目（总体）：119988
评论数目（正向）：59993
评论数目（负向）：59995
1
2
3

6. 新浪微博情感标注simplifyweibo_4_moods

参考此， 36 万多条，带情感标注新浪微博，包含 4 种情感，其中喜悦约 20 万条，愤怒、厌恶、低落各约 5 万条

代码修改部分：

def emotion_convert(string):
    dictionary = {
        0: 'POS',
        1: 'NEG',
        2: 'NEG',
        3: 'NEG'
    }
    return dictionary.get(string, None)
    
print('评论数目（正向）：%d' % pd_all[pd_all.label==0].shape[0])
print('评论数目（负向）：%d' % pd_all[pd_all.label!=0].shape[0])
1
2
3
4
5
6
7
8
9
10
11

结果：

评论数目（总体）：361744
评论数目（正向）：199496
评论数目（负向）：162248
1
2
3

7. 电影评论数据集dmsc_v2

参考此，28 部电影，超 70 万用户，超 200万条评分/评论数据

修改的部分代码如下：

def emotion_convert(string):
    dictionary = {
        5: 'POS',
        1: 'NEG'
    }
    return dictionary.get(string, None)

print('评论数目（正向）：%d' % pd_all[pd_all.rating==5].shape[0])
print('评论数目（负向）：%d' % pd_all[pd_all.rating==1].shape[0])

for row in pd_all.itertuples():
    try:
        if getattr(row, 'rating') == 1 or getattr(row, 'rating') == 5:
            file.write(getattr(row, 'comment') + ' ' + emotion_convert(getattr(row, 'rating')) + '\n')
    except:
        print('Error!')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

结果：

评论数目（总体）：2125056
评论数目（正向）：638106
评论数目（负向）：190927
1
2
3

8. 餐馆用户评论数据yf_dianping

参考此，24 万家餐馆，54 万用户，440 万条评论/评分数据

import pandas as pd

def emotion_convert(string):
    dictionary = {
        5: 'POS',
        1: 'NEG',
        0: 'NEG'
    }
    return dictionary.get(string, None)

path = '...your path/情感观点评论 倾向性分析/yf_dianping/ratings/'
pd_all = pd.read_csv(path + 'ratings.csv')

print('评论数目（总体）：%d' % pd_all.shape[0])
print('评论数目（正向）：%d' % pd_all[pd_all.rating==5].shape[0])
print('评论数目（负向）：%d' % (pd_all[pd_all.rating==0] + pd_all[pd_all.rating==1]).shape[0])

out_path = '...your path/test/yf_dianping.txt'
file = open(out_path, 'a', encoding='utf-8')

for row in pd_all.itertuples():
    # print(emotion_convert(getattr(row, 'label')),getattr(row, 'review'))
    try:
        if getattr(row, 'rating') == 0 or getattr(row, 'rating') == 5 or getattr(row, 'rating') == 1:
            file.write((getattr(row, 'comment')).replace('\n',' ') + ' ' + emotion_convert(getattr(row, 'rating')) + '\n')
    except:
        print('Error!')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

9. 商品评论数据yf_amazon

参考此，52 万件商品，1100 多个类目，142 万用户，720 万条评论/评分数据

如果一行结尾是以逗号结尾的话，会try不能运行，报excep的错

结果：

评论数目（总体）：7202921
评论数目（正向）：4184629
评论数目（负向）：293751
1
2
3

10. 文件合并

代码如下：

Catalog = ['ChnSentiCorp_htl_all', 'dmsc_v2', 'NLPCC_2014', 'online_shopping_10_cats', 'simplifyweibo_4_moods', 'waimai_10k', 'weibo_senti_100k', 'yf_amazon', 'yf_dianping']
path = '...your path/test/'

Data = open(path + 'Data.txt', 'a', encoding='utf-8')

for item in Catalog:
    file = open(path + '{}.txt'.format(item), 'r', encoding='utf-8')
    txt = file.read().strip('\n').strip(' ')
    Data.write(txt + '\n')
    file.close()
    print("{}文件合并完毕".format(item))

Data.close()
1
2
3
4
5
6
7
8
9
10
11
12
13

合并完成后总共为741MB

二、句子的向量表示

1. 不能用n_similarity就算句子相似度

由以下代码：

from gensim.models import KeyedVectors
# import jieba

word_vectors = KeyedVectors.load('vectors.kv')

str1 = '如何更换花呗绑定银行卡'
str2 = '花呗更改绑定银行卡'
# str1list = ' '.join(jieba.cut(str1)).split(' ')
# str2list = ' '.join(jieba.cut(str2)).split(' ')
#
# print(str1list)
# print(str2list)

def get_sentence_vec(list1,list2):
    from numpy import array,dot,sum
    from gensim import matutils
    tmp = []
    for item in list1:
         tmp.append(word_vectors[item])
    tmp = array(tmp).mean(axis=0)
    print(tmp)
    print(sum(tmp))
    print(matutils.unitvec(tmp))
    print(sum(matutils.unitvec(tmp)))
    print(sum((matutils.unitvec(tmp))**2))      # 求列表平方和
    tmp2 = []
    for item in list2:
         tmp2.append(word_vectors[item])
    tmp2 = array(tmp2).mean(axis=0)
    return dot(matutils.unitvec(tmp),matutils.unitvec(tmp2))


list_ = get_sentence_vec(str1,str2)
print(list_)

str1sum = [0] * word_vectors.vector_size
cnt1 = 0
for word in str1:
    # print(word)
    cnt1 += 1
    str1sum = str1sum + word_vectors[word]

cnt2 = 0
str2sum = [0] * word_vectors.vector_size
for word in str2:
    cnt2 += 1
    str2sum = str2sum + word_vectors[word]

print('求和',str1sum)
print(sum(str1sum))
print('求平均',str1sum/cnt1)
print(sum(str1sum/cnt1))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

结果：

[-2.19413742e-01...300维列表]		# tmp
3.1234498
[-2.19141953e-02...300维列表]		# L2正则化
0.3119583
1.0
0.93513715		# 直接求和、求平均、L2正则化，相似度都是一样的
求和 [-2.41355096e+00...300维列表]
34.3579681138508
求平均 [-2.19413723e-01...300维列表]		# 与tmp一样
3.1234516467137103
1
2
3
4
5
6
7
8
9
10

注意到之前生成句子向量文章的问题：

for word in str1		# 应该是str1list
1

以及matutils.unitvec()函数相当于一个L2正则化的函数，即使列表元素的平方和为1，而n_similarity函数是一个个字读入的，这显然不能与jieba分词后的结果等效：

def n_similarity(self, ws1, ws2):
    """Compute cosine similarity between two sets of words.

    Parameters
    ----------
    ws1 : list of str
        Sequence of words.
    ws2: list of str
        Sequence of words.

    Returns
    -------
    numpy.ndarray
        Similarities between `ws1` and `ws2`.

    """
    if not(len(ws1) and len(ws2)):
        raise ZeroDivisionError('At least one of the passed list is empty.')
    v1 = [self[word] for word in ws1]
    v2 = [self[word] for word in ws2]
    return dot(matutils.unitvec(array(v1).mean(axis=0)), matutils.unitvec(array(v2).mean(axis=0)))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

2. 对每个词向量求和取平均

代码如下：

from gensim.models import KeyedVectors

word_vectors = KeyedVectors.load('vectors.kv')

str1 = '如何更换花呗绑定银行卡'
str2 = '花呗更改绑定银行卡'


def get_sentence_vec(sentence):
    import jieba
    sentence_list = ' '.join(jieba.cut(sentence)).split(' ')
    vecsum = [0] * word_vectors.vector_size
    cnt = 0
    for word in sentence_list:
        vecsum = vecsum + word_vectors[word]
        cnt += 1
    return vecsum/cnt


vec1 = get_sentence_vec(str1)
vec2 = get_sentence_vec(str2)

from scipy.spatial.distance import cosine
print(cosine(vec1, vec2),1-cosine(vec1, vec2))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

结果：

0.06521224543237958 0.9347877545676204
1

和n_similarity对每个汉字求和取平均的结果相差不大

3. 情感分析数据集与测试数据集句子转向量

关于从字符串中选出固定元素可看此

报错：

TypeError: unsupported operand type(s) for /: 'list' and 'int'
1

解决办法，使用numpy，之后报警告：

RuntimeWarning: invalid value encountered in true_divide   return vecsum/cnt
1

参考此，发现是有：

Couldn’t be better. POS

这种数据，导致数组出现0/0的情况

最后生成的’Vec.txt’的文件大小有17GB，显然太大了，所以这里先不用这三个数据集：

电影评论数据集dmsc_v2
餐馆用户评论数据yf_dianping
商品评论数据yf_amazon

增加cnt为零的情况，其它和以上步骤一样

代码如下：

from gensim.models import KeyedVectors


def get_sentence_vec(sentence):
    import jieba
    import numpy as np
    sentence_list = ' '.join(jieba.cut(sentence)).split(' ')
    # vecsum = [0] * word_vectors.vector_size
    vecsum = np.zeros(word_vectors.vector_size)
    cnt = 0
    for word in sentence_list:
        try:
            vecsum = vecsum + word_vectors[word]
            cnt += 1
        except:
            continue
    if cnt == 0: return vecsum
    return vecsum/cnt


word_vectors = KeyedVectors.load('vectors.kv')
path = '...your path/Code/test/'
file = open(path + 'Data_Small.txt', 'r', encoding='utf-8')
output = open(path + 'Vec_Small.txt', 'a', encoding='utf-8')

for line in file.readlines():
    vec = get_sentence_vec(line[:-4])
    emotion = line[-4:-1]
    if vec.any() != 0:
        output.write(str(vec).replace('\n','') + ' ' + emotion + '\n')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

生成的’Data_Small.txt’文件大小为106MB，生成的’Vec_Small.txt’大小为2.38GB，程序运行时间：29分钟

测试数据集转向量代码同上，不同之处为：

word_vectors = KeyedVectors.load('vectors.kv')
path = '...your path/chinese-review-datasets/Chinese review datasets/'
file1 = open(path + 'phone_sentence.txt', 'r', encoding='utf-8')
file2 = open(path + 'phone_label.txt', 'r', encoding='utf-8')
list = file2.read().split('\n')
file2.close()
output = open(path + 'Vec_test.txt', 'a', encoding='utf-8')

i = 0
for line in file1.readlines():
    vec = get_sentence_vec(line.strip('\n'))
    if list[i] == '1':
        emotion = 'POS'
    elif list[i] == '0':
        emotion = 'NEG'
    if vec.any() != 0:
        output.write(str(vec).replace('\n','') + ' ' + emotion + '\n')
    else:
        print(i)
        # 1359 屏不比屏差(2231)
        # 20 声噪大(1172)
        #
        # 961 字太大 970 字太大
    i += 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

4. 读取句向量测量长度

这里有个问题，读每句话的最后一句应该为三个字母，这样把NORM给截掉了……

先看一下句向量长度，代码如下：

path = '...your path/test/Vec_Small.txt'
file = open(path, 'r', encoding='utf-8')
cnt_POS = 0
cnt_NEG = 0
cnt_NORM = 0
for line in file.readlines():
    if line[-4:-1] == 'POS':
        cnt_POS += 1
    elif line[-4:-1] == 'NEG':
        cnt_NEG += 1
    elif line[-4:-1] == 'ORM':
        cnt_NORM += 1
print('句向量长度：{}'.format(cnt_POS + cnt_NEG + cnt_NORM))
print('积极句向量个数：%s' % cnt_POS)
print('消极句向量个数：%s' % cnt_NEG)
print('正常句向量个数：%s' % cnt_NORM)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

结果：

句向量长度：608133
积极句向量个数：307547
消极句向量个数：270998
正常句向量个数：29588
1
2
3
4

这里可能有非平衡语料的问题，有两种措施：

先无视，继续做；
先只用积极和消极两种数据集

这里选择第二种

字符串转列表时出现：

SyntaxError: invalid syntax
1

原因：

list = []
str = '[1 2 3]'
list.append(eval(str))
print(list)
1
2
3
4

解决办法：

str = '[1, 2, 3]'
1

5. 支持向量机SVM

SVM代码如下：

import re

path = '...your path/test/Vec_Small.txt'
file = open(path, 'r', encoding='utf-8')
train_data = []
train_label = []
i = 0
for line in file.readlines():
    if line[-4:-1] == 'POS':
        train_label.append(1)
    elif line[-4:-1] == 'NEG':
        train_label.append(-1)
    elif line[-4:-1] == 'ORM':
        continue
    train_data.append(eval(re.sub(r'[ ]+', ', ', (line[:-5]).replace('[ ', '[').replace(' ]', ']'))))
    i += 1
    if i % 10000 == 0: print(i)
file.close()
print(len(train_data) == len(train_label))
print('总训练句向量数据：%d' % len(train_data))

path = '...your path/chinese-review-datasets/Chinese review datasets/Vec_test.txt'
file = open(path, 'r', encoding='utf-8')
test_data = []
test_label = []

for line in file.readlines():
    if line[-4:-1] == 'POS':
        test_label.append(1)
    elif line[-4:-1] == 'NEG':
        test_label.append(-1)
    elif line[-4:-1] == 'ORM':
        continue
    test_data.append(eval(re.sub(r'[ ]+', ', ', (line[:-5]).replace('[ ', '[').replace(' ]', ']'))))
file.close()
print(len(test_data) == len(test_label))
print('总测试句向量数据：%d' % len(test_data))


def svm(X_train, y_train, X_test, y_test):  # 支持向量机
    from sklearn.svm import SVC  # 导入支持向量机分类器SVC
    svm = SVC()  # *, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, tol=0.001, cache_size=200
    svm.fit(X_train, y_train)  # 训练模型
    print('Accuracy of svm on training set:{:.2f}'.format(svm.score(X_train, y_train)))  # 打印训练集的预测准确率
    print('Accuracy of svm on test set:{:.2f}'.format(svm.score(X_test, y_test)))  # 打印测试集的预测准确率
    predict = svm.predict(X_test)  # 预测标签
    return predict  # 返回预测的标签值


def cal_accuracy(predict, testing_labels):  # 由预测值和实际值标签计算准确率
    if len(predict) != len(testing_labels):
        print('Error!')
        return
    correct_classification = 0  # 将正确的分类数记为correct_classification
    for i in range(0, len(predict)):  # 对于每一个测试集
        if testing_labels[i] == predict[i]:
            correct_classification += 1  # 如果正确分类则correct_classification ++1
    # print("The accuracy rate is:" + str(correct_classification / testing_data_num))       # 可以打印出准确率
    return correct_classification / len(predict)  # 返回正确率


predict = svm(train_data, train_label, test_data, test_label)
print(cal_accuracy(predict, test_label))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

结果：

……
560000
570000
True
总训练句向量数据：578545
True
总测试句向量数据：6578

Process finished with exit code -1
1
2
3
4
5
6
7
8
9

问题：跑了一晚上，没出结果

有以下解决办法：

再次减小数据集大小，或者降低小数点后位数
不使用SVC而使用LineSVC()

采用第二种方法，修改代码为：

from sklearn.svm import LinearSVC  # 导入支持向量机分类器SVC
    svm = LinearSVC()  # max_iter = 1000
1
2

结果：

E:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\_base.py:976: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn("Liblinear failed to converge, increase "
Accuracy of svm on training set:0.74
Accuracy of svm on test set:0.79
0.7902097902097902
1
2
3
4
5

报警告解决措施：修改参数为max_iter=10000

模型保存和再次调用可见此，此文章有误，直接：

import joblib
1

6. 随机梯度下降SGD

详见sklearn SGD官方手册

1. 默认参数，max_iter=10000

from sklearn.linear_model import SGDClassifier

def SGD(X_train, y_train, X_test, y_test):
    from sklearn.linear_model import SGDClassifier
    import joblib
    sgd = SGDClassifier(max_iter=10000)
    sgd.fit(X_train, y_train)  # 训练模型
    joblib.dump(sgd,'sgd_model.m')
    print('Accuracy of sgd on training set:{:.2f}'.format(sgd.score(X_train, y_train)))  # 打印训练集的预测准确率
    print('Accuracy of sgd on test set:{:.2f}'.format(sgd.score(X_test, y_test)))  # 打印测试集的预测准确率
    predict = sgd.predict(X_test)  # 预测标签
    return predict  # 返回预测的标签值
1
2
3
4
5
6
7
8
9
10
11
12

Accuracy of sgd on training set:0.74
Accuracy of sgd on test set:0.78
1
2

2. 早停(validation_fraction=0.1)，缩放数据

def SGD(X_train, y_train, X_test, y_test):  
    from sklearn.linear_model import SGDClassifier
    from sklearn.preprocessing import StandardScaler
    import joblib
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)  # apply same transformation to test data,将相同的缩放应用于对应的测试向量中
    sgd = SGDClassifier(early_stopping=True, max_iter=10000)        # validation_fraction=0.1
    sgd.fit(X_train, y_train)  # 训练模型
    joblib.dump(sgd,'sgd_model_2.m')        # 早停，缩放数据
    print('Accuracy of sgd on training set:{:.2f}'.format(sgd.score(X_train, y_train)))  # 打印训练集的预测准确率
    print('Accuracy of sgd on test set:{:.2f}'.format(sgd.score(X_test, y_test)))  # 打印测试集的预测准确率
    predict = sgd.predict(X_test)  # 预测标签
    return predict  # 返回预测的标签值
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Accuracy of sgd on training set:0.70
Accuracy of sgd on test set:0.66
1
2

3. 早停(validation_fraction=0.2)

Accuracy of sgd on training set:0.70
Accuracy of sgd on test set:0.75
1
2

4. 早停(validation_fraction=0.1)，loss='modified_huber’

Accuracy of sgd on training set:0.68
Accuracy of sgd on test set:0.74
1
2

5. 早停(validation_fraction=0.1)，loss='log’

Accuracy of sgd on training set:0.71
Accuracy of sgd on test set:0.78
1
2

这里采用’‘sgd_model_5.m’’：

舍弃置信度低于0.7的数据时正确率为0.8445497630331753
舍弃置信度低于0.75的数据时正确率为0.8605553287055941
舍弃置信度低于0.8的数据时正确率为0.8764931259860266

import re

path = '...your path/chinese-review-datasets/Chinese review datasets/Vec_test.txt'
file = open(path, 'r', encoding='utf-8')
test_data = []
test_label = []

for line in file.readlines():
    if line[-4:-1] == 'POS':
        test_label.append(1)
    elif line[-4:-1] == 'NEG':
        test_label.append(-1)
    elif line[-4:-1] == 'ORM':
        continue
    test_data.append(eval(re.sub(r'[ ]+', ', ', (line[:-5]).replace('[ ', '[').replace(' ]', ']'))))
file.close()
print(len(test_data) == len(test_label))
print('总测试句向量数据：%d' % len(test_data))


def cal_accuracy(predict, testing_labels):  # 由预测值和实际值标签计算准确率
    if len(predict) != len(testing_labels):
        print('Error!')
        return
    correct_classification = 0  # 将正确的分类数记为correct_classification
    for i in range(0, len(predict)):  # 对于每一个测试集
        if testing_labels[i] == predict[i]:
            correct_classification += 1  # 如果正确分类则correct_classification ++1
    # print("The accuracy rate is:" + str(correct_classification / testing_data_num))       # 可以打印出准确率
    return correct_classification / len(predict)  # 返回正确率


def Show(test_data, predict, testing_labels):  # 由预测值和实际值标签计算准确率
    if len(predict) != len(testing_labels):
        print('Error!')
        return
    correct_classification = 0  # 将正确的分类数记为correct_classification
    uncertain_classification = 0
    proba = model.predict_proba(test_data)
    for i in range(0, len(predict)):  # 对于每一个测试集
        if proba[i][0] < 0.8 and proba[i][1] < 0.8:
            # print('置信度低于0.8：%d' % i)
            uncertain_classification += 1
            continue
        if testing_labels[i] == predict[i]:
            correct_classification += 1  # 如果正确分类则correct_classification ++1
        else:
            print('分类错误:%d' % i)
    print(uncertain_classification)
    return correct_classification/(len(predict)-uncertain_classification)


import joblib

model = joblib.load('sgd_model_5.m')

predict = model.predict(test_data)
print(cal_accuracy(predict, test_label))
# print(model.score(test_data, test_label))
# print(model.predict_proba(test_data))
print(Show(test_data, predict, test_label))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61

小结

处理了九个情感数据集，由于内存限制，暂时不使用电影评论数据集dmsc_v2、餐馆用户评论数据yf_dianping、商品评论数据yf_amazon，使用NLPCC 2014会议技术评测测试数据与答案、酒店评论数据ChnSentiCorp_htl_all、外卖平台用户评价waimai_10k、线上购物评论数据online_shopping_10_cats、新浪微博情感标注weibo_senti_100k、新浪微博情感标注simplifyweibo_4_moods这六个数据集进行后续操作
基于词向量求和取平均生成句子向量，并用SVM(支持向量机)和SGD(随机梯度下降)对Learning multi-grained aspect target sequence for Chinese sentiment analysis中情感数据集进行测试，准确率分别为0.79和0.78

后续工作：

收集更多语音识别和文本识别的结果，将以上工作应用到实践中
句子向量表示的准确率问题：
- 取消停词表，观察结果
- 尝试doc2vec
- 更后：
  - 为词向量进行td-idf加权表示句向量
  - 使用神经网络表示句向量
情感数据集准确率问题：
- 选择其它例如NLPCC 2012会议的数据集进行处理，其有测评结果，便于对比
- 尝试其它方法进行训练
- 更后：使用更多情感数据集进行训练
结果可视化（人机交互界面设计）

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/article/detail/47178?site

【NLP】8中文语句情感分析实战——酒店、微博、外卖、网购等九个数据集处理、SVM和SGD训练_8个中文领域的情感分析数据集

情感分析数据集获取与生成句向量

一、情感分析数据集处理

1. NLPCC 2014会议技术评测测试数据与答案

2. 酒店评论数据ChnSentiCorp_htl_all

3. 外卖平台用户评价waimai_10k

4. 线上购物评论数据online_shopping_10_cats

5. 新浪微博情感标注weibo_senti_100k

6. 新浪微博情感标注simplifyweibo_4_moods

7. 电影评论数据集dmsc_v2

8. 餐馆用户评论数据yf_dianping

9. 商品评论数据yf_amazon

10. 文件合并

二、句子的向量表示

1. 不能用n_similarity就算句子相似度

2. 对每个词向量求和取平均

3. 情感分析数据集与测试数据集句子转向量

4. 读取句向量测量长度

5. 支持向量机SVM

6. 随机梯度下降SGD

小结