赞
踩
第二届自然语言处理与中文计算会议(NLP&CC 2013),大小:10 000 条微博,而且与2014年的是重复的,所以使用2014年会议的数据
NLPCC 2014 Evaluation Tasks Test Data,大小:14 000 条微博,45 421句子,网站
微博语料,标注了7 emotions: like, disgust, happiness, sadness, anger, surprise, fear
# -*- coding: utf-8 -*- from bs4 import BeautifulSoup def emotion_convert(string): dictionary = { 'like':'POS', 'disgust':'NEG', 'happiness':'POS', 'sadness':'NEG', 'anger':'NEG', 'surprise':'POS', 'fear':'NEG' } return dictionary.get(string, None) NLPCC_2014_path = '...your path/NLPCC/evtestdata1/Training data for Emotion Classification.xml' out_path = '...your path/test/NLPCC_2014.txt' file = open(NLPCC_2014_path, 'r', encoding='utf-8') txt = file.read() file.close() file = open(out_path, 'a', encoding='utf-8') soup = BeautifulSoup(txt,'html.parser') for tag in soup.find_all('sentence'): file.write(tag.string + ' ') if tag.attrs['opinionated'] == 'N': file.write('NORM\n') elif tag.attrs['opinionated'] == 'Y': file.write(emotion_convert(tag.attrs['emotion-1-type'])+'\n')
参考此,7000 多条酒店评论数据,5000 多条正向评论,2000 多条负向评论
关于pandas库遍历数据集用法可参考此
import pandas as pd def emotion_convert(string): dictionary = { 1:'POS', 0:'NEG' } return dictionary.get(string, None) path = '...your path/情感观点评论 倾向性分析/ChnSentiCorp_htl_all/' pd_all = pd.read_csv(path + 'ChnSentiCorp_htl_all.csv') print('评论数目(总体):%d' % pd_all.shape[0]) print('评论数目(正向):%d' % pd_all[pd_all.label==1].shape[0]) print('评论数目(负向):%d' % pd_all[pd_all.label==0].shape[0]) # print(pd_all.sample(2)) # 构造平衡语料 out_path = '...your path/test/ChnSentiCorp_htl_all.txt' file = open(out_path, 'a', encoding='utf-8') for row in pd_all.itertuples(): # print(emotion_convert(getattr(row, 'label')),getattr(row, 'review')) try: file.write(getattr(row, 'review') + ' ' + emotion_convert(getattr(row, 'label')) + '\n') except: print('Error!')
结果:
评论数目(总体):7766
评论数目(正向):5322
评论数目(负向):2444
Error!
参考此,某外卖平台收集的用户评价,正向 4000 条,负向约 8000 条
代码同上,需要修改的:
path = '...your path/情感观点评论 倾向性分析/waimai_10k/'
pd_all = pd.read_csv(path + 'waimai_10k.csv')
out_path = '...your path/test/waimai_10k.txt'
结果:
评论数目(总体):11987
评论数目(正向):4000
评论数目(负向):7987
参考此,10 个类别,共 6 万多条评论数据,正、负向评论各约 3 万条,包括书籍、平板、手机、水果、洗发水、热水器、蒙牛、衣服、计算机、酒店
代码同上,结果:
评论数目(总体):62774
评论数目(正向):31728
评论数目(负向):31046
Error!
参考此,10 万多条,带情感标注新浪微博,正负向评论约各 5 万条
代码同上,结果:
评论数目(总体):119988
评论数目(正向):59993
评论数目(负向):59995
参考此, 36 万多条,带情感标注新浪微博,包含 4 种情感,其中喜悦约 20 万条,愤怒、厌恶、低落各约 5 万条
代码修改部分:
def emotion_convert(string):
dictionary = {
0: 'POS',
1: 'NEG',
2: 'NEG',
3: 'NEG'
}
return dictionary.get(string, None)
print('评论数目(正向):%d' % pd_all[pd_all.label==0].shape[0])
print('评论数目(负向):%d' % pd_all[pd_all.label!=0].shape[0])
结果:
评论数目(总体):361744
评论数目(正向):199496
评论数目(负向):162248
参考此,28 部电影,超 70 万用户,超 200万条评分/评论数据
修改的部分代码如下:
def emotion_convert(string): dictionary = { 5: 'POS', 1: 'NEG' } return dictionary.get(string, None) print('评论数目(正向):%d' % pd_all[pd_all.rating==5].shape[0]) print('评论数目(负向):%d' % pd_all[pd_all.rating==1].shape[0]) for row in pd_all.itertuples(): try: if getattr(row, 'rating') == 1 or getattr(row, 'rating') == 5: file.write(getattr(row, 'comment') + ' ' + emotion_convert(getattr(row, 'rating')) + '\n') except: print('Error!')
结果:
评论数目(总体):2125056
评论数目(正向):638106
评论数目(负向):190927
参考此,24 万家餐馆,54 万用户,440 万条评论/评分数据
import pandas as pd def emotion_convert(string): dictionary = { 5: 'POS', 1: 'NEG', 0: 'NEG' } return dictionary.get(string, None) path = '...your path/情感观点评论 倾向性分析/yf_dianping/ratings/' pd_all = pd.read_csv(path + 'ratings.csv') print('评论数目(总体):%d' % pd_all.shape[0]) print('评论数目(正向):%d' % pd_all[pd_all.rating==5].shape[0]) print('评论数目(负向):%d' % (pd_all[pd_all.rating==0] + pd_all[pd_all.rating==1]).shape[0]) out_path = '...your path/test/yf_dianping.txt' file = open(out_path, 'a', encoding='utf-8') for row in pd_all.itertuples(): # print(emotion_convert(getattr(row, 'label')),getattr(row, 'review')) try: if getattr(row, 'rating') == 0 or getattr(row, 'rating') == 5 or getattr(row, 'rating') == 1: file.write((getattr(row, 'comment')).replace('\n',' ') + ' ' + emotion_convert(getattr(row, 'rating')) + '\n') except: print('Error!')
参考此,52 万件商品,1100 多个类目,142 万用户,720 万条评论/评分数据
如果一行结尾是以逗号结尾的话,会try不能运行,报excep的错
结果:
评论数目(总体):7202921
评论数目(正向):4184629
评论数目(负向):293751
代码如下:
Catalog = ['ChnSentiCorp_htl_all', 'dmsc_v2', 'NLPCC_2014', 'online_shopping_10_cats', 'simplifyweibo_4_moods', 'waimai_10k', 'weibo_senti_100k', 'yf_amazon', 'yf_dianping']
path = '...your path/test/'
Data = open(path + 'Data.txt', 'a', encoding='utf-8')
for item in Catalog:
file = open(path + '{}.txt'.format(item), 'r', encoding='utf-8')
txt = file.read().strip('\n').strip(' ')
Data.write(txt + '\n')
file.close()
print("{}文件合并完毕".format(item))
Data.close()
合并完成后总共为741MB
由以下代码:
from gensim.models import KeyedVectors # import jieba word_vectors = KeyedVectors.load('vectors.kv') str1 = '如何更换花呗绑定银行卡' str2 = '花呗更改绑定银行卡' # str1list = ' '.join(jieba.cut(str1)).split(' ') # str2list = ' '.join(jieba.cut(str2)).split(' ') # # print(str1list) # print(str2list) def get_sentence_vec(list1,list2): from numpy import array,dot,sum from gensim import matutils tmp = [] for item in list1: tmp.append(word_vectors[item]) tmp = array(tmp).mean(axis=0) print(tmp) print(sum(tmp)) print(matutils.unitvec(tmp)) print(sum(matutils.unitvec(tmp))) print(sum((matutils.unitvec(tmp))**2)) # 求列表平方和 tmp2 = [] for item in list2: tmp2.append(word_vectors[item]) tmp2 = array(tmp2).mean(axis=0) return dot(matutils.unitvec(tmp),matutils.unitvec(tmp2)) list_ = get_sentence_vec(str1,str2) print(list_) str1sum = [0] * word_vectors.vector_size cnt1 = 0 for word in str1: # print(word) cnt1 += 1 str1sum = str1sum + word_vectors[word] cnt2 = 0 str2sum = [0] * word_vectors.vector_size for word in str2: cnt2 += 1 str2sum = str2sum + word_vectors[word] print('求和',str1sum) print(sum(str1sum)) print('求平均',str1sum/cnt1) print(sum(str1sum/cnt1))
结果:
[-2.19413742e-01...300维列表] # tmp
3.1234498
[-2.19141953e-02...300维列表] # L2正则化
0.3119583
1.0
0.93513715 # 直接求和、求平均、L2正则化,相似度都是一样的
求和 [-2.41355096e+00...300维列表]
34.3579681138508
求平均 [-2.19413723e-01...300维列表] # 与tmp一样
3.1234516467137103
注意到之前生成句子向量文章的问题:
for word in str1 # 应该是str1list
以及matutils.unitvec()函数相当于一个L2正则化的函数,即使列表元素的平方和为1,而n_similarity函数是一个个字读入的,这显然不能与jieba分词后的结果等效:
def n_similarity(self, ws1, ws2): """Compute cosine similarity between two sets of words. Parameters ---------- ws1 : list of str Sequence of words. ws2: list of str Sequence of words. Returns ------- numpy.ndarray Similarities between `ws1` and `ws2`. """ if not(len(ws1) and len(ws2)): raise ZeroDivisionError('At least one of the passed list is empty.') v1 = [self[word] for word in ws1] v2 = [self[word] for word in ws2] return dot(matutils.unitvec(array(v1).mean(axis=0)), matutils.unitvec(array(v2).mean(axis=0)))
代码如下:
from gensim.models import KeyedVectors word_vectors = KeyedVectors.load('vectors.kv') str1 = '如何更换花呗绑定银行卡' str2 = '花呗更改绑定银行卡' def get_sentence_vec(sentence): import jieba sentence_list = ' '.join(jieba.cut(sentence)).split(' ') vecsum = [0] * word_vectors.vector_size cnt = 0 for word in sentence_list: vecsum = vecsum + word_vectors[word] cnt += 1 return vecsum/cnt vec1 = get_sentence_vec(str1) vec2 = get_sentence_vec(str2) from scipy.spatial.distance import cosine print(cosine(vec1, vec2),1-cosine(vec1, vec2))
结果:
0.06521224543237958 0.9347877545676204
和n_similarity对每个汉字求和取平均的结果相差不大
关于从字符串中选出固定元素可看此
报错:
TypeError: unsupported operand type(s) for /: 'list' and 'int'
解决办法,使用numpy,之后报警告:
RuntimeWarning: invalid value encountered in true_divide return vecsum/cnt
参考此,发现是有:
Couldn’t be better. POS
这种数据,导致数组出现0/0的情况
最后生成的’Vec.txt’的文件大小有17GB,显然太大了,所以这里先不用这三个数据集:
增加cnt为零的情况,其它和以上步骤一样
代码如下:
from gensim.models import KeyedVectors def get_sentence_vec(sentence): import jieba import numpy as np sentence_list = ' '.join(jieba.cut(sentence)).split(' ') # vecsum = [0] * word_vectors.vector_size vecsum = np.zeros(word_vectors.vector_size) cnt = 0 for word in sentence_list: try: vecsum = vecsum + word_vectors[word] cnt += 1 except: continue if cnt == 0: return vecsum return vecsum/cnt word_vectors = KeyedVectors.load('vectors.kv') path = '...your path/Code/test/' file = open(path + 'Data_Small.txt', 'r', encoding='utf-8') output = open(path + 'Vec_Small.txt', 'a', encoding='utf-8') for line in file.readlines(): vec = get_sentence_vec(line[:-4]) emotion = line[-4:-1] if vec.any() != 0: output.write(str(vec).replace('\n','') + ' ' + emotion + '\n')
生成的’Data_Small.txt’文件大小为106MB,生成的’Vec_Small.txt’大小为2.38GB,程序运行时间:29分钟
测试数据集转向量代码同上,不同之处为:
word_vectors = KeyedVectors.load('vectors.kv') path = '...your path/chinese-review-datasets/Chinese review datasets/' file1 = open(path + 'phone_sentence.txt', 'r', encoding='utf-8') file2 = open(path + 'phone_label.txt', 'r', encoding='utf-8') list = file2.read().split('\n') file2.close() output = open(path + 'Vec_test.txt', 'a', encoding='utf-8') i = 0 for line in file1.readlines(): vec = get_sentence_vec(line.strip('\n')) if list[i] == '1': emotion = 'POS' elif list[i] == '0': emotion = 'NEG' if vec.any() != 0: output.write(str(vec).replace('\n','') + ' ' + emotion + '\n') else: print(i) # 1359 屏不比屏差(2231) # 20 声噪大(1172) # # 961 字太大 970 字太大 i += 1
这里有个问题,读每句话的最后一句应该为三个字母,这样把NORM给截掉了……
先看一下句向量长度,代码如下:
path = '...your path/test/Vec_Small.txt' file = open(path, 'r', encoding='utf-8') cnt_POS = 0 cnt_NEG = 0 cnt_NORM = 0 for line in file.readlines(): if line[-4:-1] == 'POS': cnt_POS += 1 elif line[-4:-1] == 'NEG': cnt_NEG += 1 elif line[-4:-1] == 'ORM': cnt_NORM += 1 print('句向量长度:{}'.format(cnt_POS + cnt_NEG + cnt_NORM)) print('积极句向量个数:%s' % cnt_POS) print('消极句向量个数:%s' % cnt_NEG) print('正常句向量个数:%s' % cnt_NORM)
结果:
句向量长度:608133
积极句向量个数:307547
消极句向量个数:270998
正常句向量个数:29588
这里可能有非平衡语料的问题,有两种措施:
这里选择第二种
字符串转列表时出现:
SyntaxError: invalid syntax
原因:
list = []
str = '[1 2 3]'
list.append(eval(str))
print(list)
解决办法:
str = '[1, 2, 3]'
SVM代码如下:
import re path = '...your path/test/Vec_Small.txt' file = open(path, 'r', encoding='utf-8') train_data = [] train_label = [] i = 0 for line in file.readlines(): if line[-4:-1] == 'POS': train_label.append(1) elif line[-4:-1] == 'NEG': train_label.append(-1) elif line[-4:-1] == 'ORM': continue train_data.append(eval(re.sub(r'[ ]+', ', ', (line[:-5]).replace('[ ', '[').replace(' ]', ']')))) i += 1 if i % 10000 == 0: print(i) file.close() print(len(train_data) == len(train_label)) print('总训练句向量数据:%d' % len(train_data)) path = '...your path/chinese-review-datasets/Chinese review datasets/Vec_test.txt' file = open(path, 'r', encoding='utf-8') test_data = [] test_label = [] for line in file.readlines(): if line[-4:-1] == 'POS': test_label.append(1) elif line[-4:-1] == 'NEG': test_label.append(-1) elif line[-4:-1] == 'ORM': continue test_data.append(eval(re.sub(r'[ ]+', ', ', (line[:-5]).replace('[ ', '[').replace(' ]', ']')))) file.close() print(len(test_data) == len(test_label)) print('总测试句向量数据:%d' % len(test_data)) def svm(X_train, y_train, X_test, y_test): # 支持向量机 from sklearn.svm import SVC # 导入支持向量机分类器SVC svm = SVC() # *, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, tol=0.001, cache_size=200 svm.fit(X_train, y_train) # 训练模型 print('Accuracy of svm on training set:{:.2f}'.format(svm.score(X_train, y_train))) # 打印训练集的预测准确率 print('Accuracy of svm on test set:{:.2f}'.format(svm.score(X_test, y_test))) # 打印测试集的预测准确率 predict = svm.predict(X_test) # 预测标签 return predict # 返回预测的标签值 def cal_accuracy(predict, testing_labels): # 由预测值和实际值标签计算准确率 if len(predict) != len(testing_labels): print('Error!') return correct_classification = 0 # 将正确的分类数记为correct_classification for i in range(0, len(predict)): # 对于每一个测试集 if testing_labels[i] == predict[i]: correct_classification += 1 # 如果正确分类则correct_classification ++1 # print("The accuracy rate is:" + str(correct_classification / testing_data_num)) # 可以打印出准确率 return correct_classification / len(predict) # 返回正确率 predict = svm(train_data, train_label, test_data, test_label) print(cal_accuracy(predict, test_label))
结果:
……
560000
570000
True
总训练句向量数据:578545
True
总测试句向量数据:6578
Process finished with exit code -1
问题:跑了一晚上,没出结果
有以下解决办法:
采用第二种方法,修改代码为:
from sklearn.svm import LinearSVC # 导入支持向量机分类器SVC
svm = LinearSVC() # max_iter = 1000
结果:
E:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\_base.py:976: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
warnings.warn("Liblinear failed to converge, increase "
Accuracy of svm on training set:0.74
Accuracy of svm on test set:0.79
0.7902097902097902
报警告解决措施:修改参数为max_iter=10000
模型保存和再次调用可见此,此文章有误,直接:
import joblib
1. 默认参数,max_iter=10000
from sklearn.linear_model import SGDClassifier
def SGD(X_train, y_train, X_test, y_test):
from sklearn.linear_model import SGDClassifier
import joblib
sgd = SGDClassifier(max_iter=10000)
sgd.fit(X_train, y_train) # 训练模型
joblib.dump(sgd,'sgd_model.m')
print('Accuracy of sgd on training set:{:.2f}'.format(sgd.score(X_train, y_train))) # 打印训练集的预测准确率
print('Accuracy of sgd on test set:{:.2f}'.format(sgd.score(X_test, y_test))) # 打印测试集的预测准确率
predict = sgd.predict(X_test) # 预测标签
return predict # 返回预测的标签值
Accuracy of sgd on training set:0.74
Accuracy of sgd on test set:0.78
2. 早停(validation_fraction=0.1),缩放数据
def SGD(X_train, y_train, X_test, y_test):
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
import joblib
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test) # apply same transformation to test data,将相同的缩放应用于对应的测试向量中
sgd = SGDClassifier(early_stopping=True, max_iter=10000) # validation_fraction=0.1
sgd.fit(X_train, y_train) # 训练模型
joblib.dump(sgd,'sgd_model_2.m') # 早停,缩放数据
print('Accuracy of sgd on training set:{:.2f}'.format(sgd.score(X_train, y_train))) # 打印训练集的预测准确率
print('Accuracy of sgd on test set:{:.2f}'.format(sgd.score(X_test, y_test))) # 打印测试集的预测准确率
predict = sgd.predict(X_test) # 预测标签
return predict # 返回预测的标签值
Accuracy of sgd on training set:0.70
Accuracy of sgd on test set:0.66
3. 早停(validation_fraction=0.2)
Accuracy of sgd on training set:0.70
Accuracy of sgd on test set:0.75
4. 早停(validation_fraction=0.1),loss='modified_huber’
Accuracy of sgd on training set:0.68
Accuracy of sgd on test set:0.74
5. 早停(validation_fraction=0.1),loss='log’
Accuracy of sgd on training set:0.71
Accuracy of sgd on test set:0.78
这里采用’‘sgd_model_5.m’’:
import re path = '...your path/chinese-review-datasets/Chinese review datasets/Vec_test.txt' file = open(path, 'r', encoding='utf-8') test_data = [] test_label = [] for line in file.readlines(): if line[-4:-1] == 'POS': test_label.append(1) elif line[-4:-1] == 'NEG': test_label.append(-1) elif line[-4:-1] == 'ORM': continue test_data.append(eval(re.sub(r'[ ]+', ', ', (line[:-5]).replace('[ ', '[').replace(' ]', ']')))) file.close() print(len(test_data) == len(test_label)) print('总测试句向量数据:%d' % len(test_data)) def cal_accuracy(predict, testing_labels): # 由预测值和实际值标签计算准确率 if len(predict) != len(testing_labels): print('Error!') return correct_classification = 0 # 将正确的分类数记为correct_classification for i in range(0, len(predict)): # 对于每一个测试集 if testing_labels[i] == predict[i]: correct_classification += 1 # 如果正确分类则correct_classification ++1 # print("The accuracy rate is:" + str(correct_classification / testing_data_num)) # 可以打印出准确率 return correct_classification / len(predict) # 返回正确率 def Show(test_data, predict, testing_labels): # 由预测值和实际值标签计算准确率 if len(predict) != len(testing_labels): print('Error!') return correct_classification = 0 # 将正确的分类数记为correct_classification uncertain_classification = 0 proba = model.predict_proba(test_data) for i in range(0, len(predict)): # 对于每一个测试集 if proba[i][0] < 0.8 and proba[i][1] < 0.8: # print('置信度低于0.8:%d' % i) uncertain_classification += 1 continue if testing_labels[i] == predict[i]: correct_classification += 1 # 如果正确分类则correct_classification ++1 else: print('分类错误:%d' % i) print(uncertain_classification) return correct_classification/(len(predict)-uncertain_classification) import joblib model = joblib.load('sgd_model_5.m') predict = model.predict(test_data) print(cal_accuracy(predict, test_label)) # print(model.score(test_data, test_label)) # print(model.predict_proba(test_data)) print(Show(test_data, predict, test_label))
后续工作:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。