当前位置:   article > 正文

中文情感分类代码_中文情感分析代码

中文情感分析代码

## 中文情感分类--关于疫情、微博、中文、文本

本次中文情感分析源于数据挖掘与分析课大作业,主要内容为:对疫情期间的微博文本进行情感分类,进而分析情感变化。

 1. 数据集:训练集和待预测数据集,其中训练集为打好标签的微博疫情相关文本,待预测训练集为情感趋势来源。
 2. python库:主要使用 jieba、pandas,其余详见import
 3. 主要涉及内容有:分词,去停用词,构建词向量模型,分词文本向量化,模型训练,预测等部分。

【文件路径\\、/没有修改成一致。部分代码不够简洁流畅,仅提供步骤参考,相关文件、代码(同组成员的微博爬虫、清洗、以及数据集链接)会考虑需要上传】


part 1:训练集文本

    --分词,去停,构建词向量(这里没有用pandas,十分后悔)

1.import部分及main方法:

  1. import jieba
  2. import numpy as np
  3. import pandas as pd
  4. import os
  5. import gensim
  6. from gensim.test.utils import common_texts,get_tmpfile
  7. from gensim.models import Word2Vec
  8. import math
  9. import csv
  10. if __name__=='__main__':
  11. data = pd.read_csv('D:\\documents\\data mining\\数据集\\情感分类-疫情微博\\nCoV_100k_train.labled.csv',engine="python")
  12. #data = pd.read_csv('D:\\documents\\data mining\\数据集\\普通情感分类-7\\情感训练集.csv')
  13. #print(data.head())
  14. #提取目标列,第2列
  15. data1 = list(data.iloc[:,3]) #根据数据集修改 100k-3,情感训练集-0
  16. #print(data1[0])
  17. label = list(data.iloc[:,6]) #根据数据集修改 100k-6,情感训练集-1
  18. #分词
  19. size = 100 #词向量模型
  20. (data2,label) = word_cut(data1,label,size) #返回分词后列表,以字符串为元素,字符串用','隔开字符
  21. print('分词成功')
  22. print(len(data2),len(label))

2.分词,去停,词向量

  1. def word_cut(data1,label,size):
  2. filelist = []
  3. for i in data1:
  4. i=str(i)
  5. i = i.replace('展开全文c','')
  6. s=jieba.cut(i,cut_all=False)
  7. cutstr = '$$$'.join(s)
  8. '''
  9. s1 = iter(s)
  10. cutstr=''
  11. for i in s1:
  12. if cutstr =='':
  13. cutstr+=i
  14. else:
  15. cutstr+='$$$'
  16. cutstr+=i
  17. '''
  18. textlist = cutstr.split('$$$')
  19. #print(textlist)
  20. filelist.append(textlist)
  21. filelist = removesw(filelist) #去停用词后的list,可能有空
  22. j=0
  23. for i in range(len(filelist)):#删除空值
  24. if len(filelist[i-j])== 0:
  25. del filelist[i-j]
  26. del label[i-j]
  27. j+=1
  28. #print(len(filelist),len(label))
  29. #print(filelist[0],label[0])
  30. #print(filelist[1],label[1])
  31. #print(filelist[-2],label[-2])
  32. #print(filelist[-1],label[-1])
  33. #打开txt
  34. txtfile = open('D:/documents/data mining/数据集/代码/data_cut.txt',mode = 'w')
  35. for i in range(len(filelist)):
  36. string=''
  37. for j in filelist[i]:
  38. if j != '':
  39. if string == '':
  40. string += j
  41. else:
  42. string += ','
  43. string += j
  44. ##写入txt文件 #分词+label
  45. txtfile.write(string.encode("gbk", 'ignore').decode("gbk", "ignore")+' '+str(label[i])+'\n')
  46. txtfile.close()
  47. print('cut_word写入txt')
  48. model = Word2Vec(filelist,size=size,window=5,min_count=1,workers=4)
  49. model.save("D:/documents/data mining/数据集/代码/word2vec.bin")
  50. print('cut_word加入词向量模型')
  51. return (filelist,label)

本段主要为 利用结巴分词进行分词,分词结果使用$$$分隔,使用下方去停方法。

将去停后的分词文本加入词向量模型,其中word2vec中的filelist只要为可循环的变量均可,后续往词向量模型加入,以及获得文本向量的语句见part2.

  1. def removesw(filelist): #filelist:由分词构成的list
  2. stop_word = None
  3. with open('D:/documents/data mining/数据集/stopwords-master/cn_stopwords.txt','r',encoding = 'utf-8') as f:
  4. stop_words = f.readlines()
  5. stop_words = [word.replace('\n','') for word in stop_words]
  6. # stop word 替换
  7. #i=0
  8. for i in range(len(filelist)):
  9. filelist[i]=[x for x in filelist[i] if x not in stop_words]
  10. return filelist

本段去停用词,txt为网络找的停用词表,中途会根据微博语境增删改。for循环里的代码比较核心。

part 2:预测集数据

  --本部分主要使用pandas库,对预测集分词、去停,结果加入part1中构建的词向量模型。然后利用词向量模型、训练集&预测集分析结果,构建文本向量并写入.csv文件。

1.import部分+数据清洗、分词、去停

(清洗部分希望去掉部分无意义词段,防止分词后无法去除。)

  1. import os
  2. import pandas as pd
  3. import jieba
  4. import gensim
  5. from gensim.test.utils import common_texts,get_tmpfile
  6. from gensim.models import Word2Vec
  7. import numpy as np
  8. import csv
  9. #----数据清洗,分词----
  10. with open('D:/documents/data mining/数据集/stopwords-master/cn_stopwords.txt','r',encoding = 'utf-8') as f:
  11. stop_words = f.readlines()
  12. stop_words = [word.replace('\n','') for word in stop_words]
  13. stop_words.append('\u200b')
  14. origin_dir='D:\\documents\\data mining\\数据集\\代码\\cleaned_text\\'
  15. files=os.listdir(origin_dir)
  16. after_clean_dir='D:\\documents\\data mining\\数据集\\代码\\after_clean\\'
  17. def clean_mix(s):
  18. #print(type(s))
  19. return s.replace('收起全文d','').replace('展开全文d','').replace('的秒拍视频','').replace('的微博视频','').replace('的快手视频','').replace('\n','').replace('O网页链接','')
  20. def after_jieba_stopword(s):
  21. a=jieba.cut(str(s),cut_all=False)
  22. b = '$$$'.join(a)
  23. c=[x for x in b.split('$$$') if x not in stop_words]
  24. return ' '.join(c)
  25. N_origin=0
  26. N_filter=0
  27. for file in files:
  28. data=pd.read_table(origin_dir+file,sep=',',encoding='utf-8')
  29. N_origin+=len(data)
  30. #分词
  31. data['cleaned_text']=data['cleaned_text'].map(lambda x:clean_mix(str(x)) if type(x)==type('') else '') #去词
  32. data['cleaned_text']=data['cleaned_text'].map(lambda x:after_jieba_stopword(x)) #分词,去停用词
  33. data['removeWellSign']=data['removeWellSign'].map(lambda x:clean_mix(str(x)) if type(x)==type('') else '')
  34. data['removeWellSign']=data['removeWellSign'].map(lambda x:after_jieba_stopword(x))
  35. data_filter=data.loc[data['cleaned_text']!='',:]
  36. data_filter['id']=np.arange(0,len(data_filter),1)
  37. N_filter+=len(data_filter)
  38. data_filter[['id','original_text','cleaned_text','removeWellSign']].to_csv(after_clean_dir+file,sep=',',index=None,encoding='utf-8')
  39. print(file,'over')
  40. print(N_origin)
  41. print(N_filter)

2.词向量模型训练

  --待预测数据集分词结果加入词向量模型

  1. #训练模型,向量化
  2. after_clean_dir='D:\\documents\\data mining\\数据集\\代码\\after_clean\\'
  3. files=os.listdir(after_clean_dir)
  4. model = Word2Vec.load("D:/documents/data mining/数据集/代码/word2vec.bin")
  5. for file in files:
  6. data=pd.read_table(after_clean_dir+file,sep=',',encoding='utf-8')
  7. filelist=list(data['cleaned_text'].map(lambda x:x.split(' ')) )
  8. model.train(filelist,total_examples=model.corpus_count,epochs= model.iter)
  9. print(file,'train over')
  10. model.save("D:/documents/data mining/数据集/代码/word2vec.bin")
  11. print('预测文本加入词向量模型-成功')

3.文本向量化

利用分词后的文本,分别从词向量模型中获得词语对应向量(向量中不包含所有词),加总(权重为1)、平均,得到句子对应文本向量。

  1. #模型106万条文本的向量化
  2. after_clean_dir='D:\\documents\\data mining\\数据集\\代码\\after_clean\\'
  3. vectors_dir='D:\\documents\\data mining\\数据集\\代码\\vectors\\'
  4. files=os.listdir(after_clean_dir)
  5. model = Word2Vec.load("D:/documents/data mining/数据集/代码/word2vec.bin")
  6. for file in files:
  7. data=pd.read_table(after_clean_dir+file,sep=',',encoding='utf-8')
  8. filelist=list(data['cleaned_text'].map(lambda x:x.split(' ')))
  9. df=pd.DataFrame()
  10. for text in filelist:
  11. text_vector = np.zeros(100).reshape((1,100))
  12. count = 0
  13. for word in text:
  14. try:
  15. text_vector += model[word].reshape((1,100))
  16. #print(word,model[word])
  17. count += 1
  18. except KeyError:
  19. continue
  20. if count !=0:
  21. text_vector /= count #count个单词,所以除以count
  22. vector_list= list(list(text_vector)[0])
  23. df=df.append(pd.Series(vector_list),ignore_index=True)
  24. df.to_csv(vectors_dir+file,sep=',',index=None,header=None)
  25. print(file,'train over')
  26. #---训练集文本向量化---
  27. model = Word2Vec.load("D:/documents/data mining/数据集/代码/word2vec.bin")
  28. txtfile = open('D:\\documents\\data mining\\数据集\\代码\\data_cut.txt','r')
  29. data=[]
  30. for i in txtfile.readlines():
  31. a=i.split(' ')
  32. a = [word.replace('\n','') for word in a]
  33. #print(a)
  34. data.append(a) #[[cut_word,label],[cut_word,label]]
  35. for i in data:
  36. text = i[0].split(',')
  37. text_vector = np.zeros(100).reshape((1,100))
  38. count = 0
  39. for word in text:
  40. try:
  41. text_vector += model[word].reshape((1,100))
  42. count += 1
  43. except KeyError:
  44. continue
  45. if count !=0:
  46. text_vector /= count #count个单词,所以除以count
  47. vector_list= list(list(text_vector)[0])
  48. #print(i[0],vector_list)
  49. i=i.append(vector_list) #
  50. print(data[0])
  51. with open('D:\\documents\\data mining\\数据集\\代码\\trainText_vector.csv','w',newline='') as tf:
  52. writer = csv.writer(tf,delimiter = ',')
  53. #writer.writerow(file_columns)
  54. for row in data:
  55. #print(row)
  56. row1 = row[2]
  57. row1.append(int(row[1]))
  58. #print(row1)
  59. writer.writerow(row1)
  60. tf.close()
  61. print('训练文本向量化完成')

4.模型训练

--这里的模型为决策树模型,使用OneVsOne分类方式,是经过挑选的。训练过程中,将训练集向量9:1分为训练集和测试集,正确率较高,且在预测分类中效果较好。

  1. from sklearn.multiclass import OneVsOneClassifier
  2. from sklearn.tree import DecisionTreeRegressor
  3. from sklearn.model_selection import train_test_split
  4. from sklearn.preprocessing import label_binarize
  5. from joblib import dump, load
  6. #---模型训练及预测---
  7. after_clean_dir='D:\\documents\\data mining\\数据集\\代码\\after_clean\\'
  8. vectors_dir='D:\\documents\\data mining\\数据集\\代码\\vectors\\'
  9. label_dir='D:\\documents\\data mining\\数据集\\代码\\text_label\\'
  10. files=os.listdir(after_clean_dir)
  11. #模型训练
  12. labeled_path = 'D:\\documents\\data mining\\数据集\\代码\\trainText_vector.csv'
  13. labeled=pd.read_table(labeled_path,sep=',')
  14. n=len(labeled)#11281
  15. vectors=labeled.iloc[:,:-1]
  16. labels=labeled.iloc[:,-1]
  17. X_train, X_test, y_train, y_test = train_test_split(vectors, labels, test_size=0.2)
  18. y_test_list=list(y_test)
  19. y_train_list2=np.array(list(y_train.map(lambda x:[x])))
  20. X_train_list=np.array(X_train)
  21. X_test_list=np.array(X_test)
  22. n_train=len(y_train)#10152
  23. n_test=len(y_test)#1129
  24. def accuracy(a,b):
  25. c=[]
  26. for i in range(len(a)):
  27. if a[i]==b[i]:
  28. c.append(1)
  29. else:
  30. c.append(0)
  31. return sum(c)/len(c)
  32. model_tree_one=OneVsOneClassifier(DecisionTreeRegressor()) #2v2
  33. model_tree_one.fit(X_train,y_train)
  34. predict_tree_one=model_tree_one.predict(X_test)
  35. print(predict_tree_one)
  36. accuracy_tree_one=accuracy(predict_tree_one,y_test_list) #0.7478753541076487
  37. print("accuracy_tree_one:"+str(accuracy_tree_one))
  38. dump(model_tree_one,'model_tree_one.joblib')
  39. print('预测模型建立并存储完成')

5.情感分类预测

  1. #预测
  2. #model_tree_one=load('D:\\documents\\data mining\\数据集\\代码\\model_tree_one.joblib')
  3. model_tree_one=load('D:\\documents\\data mining\\数据集\\代码\\svc.joblib')
  4. for file in files:
  5. vectors_file=pd.read_table(vectors_dir+file,sep=',',header=None)
  6. text_file=pd.read_table(after_clean_dir+file,sep=',')
  7. result=model_tree_one.predict(vectors_file)
  8. text_file['label']=result
  9. text_file.to_csv(label_dir+file,sep=',',index=None)
  10. print(file,'predict over')

6.随便输出到.csv的分类结果(积极,消极,总数等)

  1. # 预测结果统计
  2. from pandas import DataFrame
  3. analysis_dir = 'D:\\documents\\data mining\\数据集\\代码\\text_label\\'
  4. analysis_files = os.listdir(analysis_dir)
  5. #analysis_data = {'date':[],'neg':[],'pos':[],'total':[]}
  6. analysis_df = DataFrame(data=[],index=[],columns=['deta','neg','pos','total'])
  7. for file in analysis_files:
  8. analysis_file = pd.read_table(analysis_dir+file,sep=',')
  9. #pos = analysis_file.loc[analysis_file['label'] == '1',:].count()
  10. #neg = analysis_file.loc[analysis_file['label'] == '-1',:].count()
  11. vc=analysis_file['label'].value_counts(normalize = False, dropna = False)
  12. pos = vc[1]
  13. neg = vc[-1]
  14. total = analysis_file['label'].count()
  15. print(file,neg,pos,total) #
  16. analysis_df=analysis_df.append(pd.DataFrame([[file.replace('.csv','').replace('.','-'),neg,pos,total]],columns=['deta','neg','pos','total']))
  17. analysis_df.to_csv('D:\\documents\\data mining\\数据集\\代码\\结果图.csv',sep=',',index=None)

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/笔触狂放9/article/detail/332024
推荐阅读
相关标签
  

闽ICP备14008679号