当前位置:   article > 正文

美团店铺评价语言处理以及分类(tfidf,SVM,决策树,随机森林,Knn,ensemble)...

美团店铺评价语言处理以及分类(tfidf,SVM,决策树,随机森林,Knn,ensemble)...

美团店铺评价语言处理以及分类(tfidf,SVM,决策树,随机森林,Knn,ensemble)

  1. import pandas as pd
  2. import numpy as np
  3. import matplotlib.pyplot as plt
  4. import time
  1. df=pd.read_excel("all_data_meituan.xlsx")[["comment","star"]]
  2. df.head()
commentstar
0还行吧,建议不要排队那个烤鸭和羊肉串,因为烤肉时间本来就不够,排那个要半小时,然后再回来吃烤...40
1去过好几次了 东西还是老样子 没增添什么新花样 环境倒是挺不错 离我们这也挺近 味道还可以 ...40
2一个字:好!!! #羊肉串# #五花肉# #牛舌# #很好吃# #鸡软骨# #拌菜# #抄河...50
3第一次来吃,之前看过好多推荐说这个好吃,真的抱了好大希望,排队的人挺多的,想吃得趁早来啊。还...20
4羊肉串真的不太好吃,那种说膻不膻说臭不臭的味。烤鸭还行,大虾没少吃,也就到那吃大虾了,吃完了...30
df.shape
(17400, 2)
  1. df['sentiment']=df['star'].apply(lambda x:1 if x>30 else 0)
  2. df=df.drop_duplicates() ## 去掉重复的评论
  3. df=df.dropna()
  1. X=pd.concat([df[['comment']],df[['comment']],df[['comment']]])
  2. y=pd.concat([df.sentiment,df.sentiment,df.sentiment])
  3. X.columns=['comment']
  4. X.reset_index
  5. X.shape
(3138, 1)
  1. import jieba
  2. def chinese_word_cut(mytext):
  3. return " ".join(jieba.cut(mytext))
  4. X['cut_comment']=X["comment"].apply(chinese_word_cut)
  5. X['cut_comment'].head()
  1. Building prefix dict from the default dictionary ...
  2. Loading model from cache C:\Users\FRED-H~1\AppData\Local\Temp\jieba.cache
  3. Loading model cost 0.651 seconds.
  4. Prefix dict has been built succesfully.
  5. 0 还行 吧 , 建议 不要 排队 那个 烤鸭 和 羊肉串 , 因为 烤肉 时间 本来 就 不够...
  6. 1 去过 好 几次 了 东西 还是 老 样子 没 增添 什么 新花样 环境 倒 是 ...
  7. 2 一个 字 : 好 ! ! ! # 羊肉串 # # 五花肉 # # 牛舌 # ...
  8. 3 第一次 来 吃 , 之前 看过 好多 推荐 说 这个 好吃 , 真的 抱 了 好 大 希望 ...
  9. 4 羊肉串 真的 不太 好吃 , 那种 说 膻 不 膻 说 臭 不 臭 的 味 。 烤鸭 还 行...
  10. Name: cut_comment, dtype: object
  1. from sklearn.model_selection import train_test_split
  2. X_train,X_test,y_train,y_test= train_test_split(X,y,random_state=42,test_size=0.25)
  1. def get_custom_stopwords(stop_words_file):
  2. with open(stop_words_file,encoding="utf-8") as f:
  3. custom_stopwords_list=[i.strip() for i in f.readlines()]
  4. return custom_stopwords_list
  1. stop_words_file = "stopwords.txt"
  2. stopwords = get_custom_stopwords(stop_words_file)
  3. stopwords[-10:]
['100', '01', '02', '03', '04', '05', '06', '07', '08', '09']
  1. from sklearn.feature_extraction.text import CountVectorizer
  2. vect=CountVectorizer()
  3. vect
  1. CountVectorizer(analyzer='word', binary=False, decode_error='strict',
  2. dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
  3. lowercase=True, max_df=1.0, max_features=None, min_df=1,
  4. ngram_range=(1, 1), preprocessor=None, stop_words=None,
  5. strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
  6. tokenizer=None, vocabulary=None)
vect.fit_transform(X_train["cut_comment"])
  1. <2353x1965 sparse matrix of type '<class 'numpy.int64'>'
  2. with 20491 stored elements in Compressed Sparse Row format>
vect.fit_transform(X_train["cut_comment"]).toarray().shape
(2353, 1965)
  1. # pd.DataFrame(vect.fit_transform(X_train["cut_comment"]).toarray(),columns=vect.get_feature_names()).iloc[:10,:22]
  2. # print(vect.get_feature_names())
  3. # # 数据维数1956,不算很大(未使用停用词)
  1. vect = CountVectorizer(token_pattern=u'(?u)\\b[^\\d\\W]\\w+\\b',stop_words=frozenset(stopwords)) # 去除停用词
  2. pd.DataFrame(vect.fit_transform(X_train['cut_comment']).toarray(), columns=vect.get_feature_names()).head()
  3. # 1691 columns,去掉以数字为特征值的列,减少了三列编程1691
  4. # max_df = 0.8 # 在超过这一比例的文档中出现的关键词(过于平凡),去除掉。
  5. # min_df = 3 # 在低于这一数量的文档中出现的关键词(过于独特),去除掉。
amazinghappyktvpm2一万个一个多一个月一串一人一件...麻烦麻酱黄喉黄桃黄花鱼黄金黑乎乎黑椒黑胡椒齐全
00000000000...0000000000
10000000000...0000000000
20000000000...0000000000
30000000000...0000000000
40000000000...0000000000

5 rows × 1691 columns

  1. from sklearn.pipeline import make_pipeline
  2. from sklearn.svm import SVC
  3. from sklearn import metrics
  4. svc_cl=SVC()
  5. pipe=make_pipeline(vect,svc_cl)
  6. pipe.fit(X_train.cut_comment, y_train)
  1. Pipeline(memory=None,
  2. steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
  3. dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
  4. lowercase=True, max_df=1.0, max_features=None, min_df=1,
  5. ngram_range=(1, 1), preprocessor=None,
  6. stop_words=...,
  7. max_iter=-1, probability=False, random_state=None, shrinking=True,
  8. tol=0.001, verbose=False))])
  1. y_pred = pipe.predict(X_test.cut_comment)
  2. metrics.accuracy_score(y_test,y_pred)
0.6318471337579618
metrics.confusion_matrix(y_test,y_pred)
  1. array([[ 0, 289],
  2. [ 0, 496]], dtype=int64)
支持向量机分类
  1. from sklearn.svm import SVC
  2. svc_cl=SVC() # 实例化
  3. pipe=make_pipeline(vect,svc_cl)
  4. pipe.fit(X_train.cut_comment, y_train)
  5. y_pred = pipe.predict(X_test.cut_comment)
  6. metrics.accuracy_score(y_test,y_pred)
0.6318471337579618
支持向量机 网格搜索
  1. from sklearn.model_selection import GridSearchCV
  2. from sklearn.svm import SVC
  3. from sklearn.pipeline import Pipeline
  4. # svc=SVC(random_state=1)
  5. from sklearn.linear_model import SGDClassifier
  6. from sklearn.feature_extraction.text import TfidfTransformer
  7. tfidf=TfidfTransformer()
  8. # ('tfidf',
  9. # TfidfTransformer()),
  10. # ('clf',
  11. # SGDClassifier(max_iter=1000)),
  12. # svc=SGDClassifier(max_iter=1000)
  13. svc=SVC()
  14. # pipe=make_pipeline(vect,SVC)
  15. pipe_svc=Pipeline([("scl",vect),('tfidf',tfidf),("clf",svc)])
  16. para_range=[0.0001,0.001,0.01,0.1,1.0,10,100,1000]
  17. para_grid=[
  18. {'clf__C':para_range,
  19. 'clf__kernel':['linear']},
  20. {'clf__gamma':para_range,
  21. 'clf__kernel':['rbf']}
  22. ]
gs=GridSearchCV(estimator=pipe_svc,param_grid=para_grid,cv=10,n_jobs=-1)
gs.fit(X_train.cut_comment,y_train)
  1. GridSearchCV(cv=10, error_score='raise',
  2. estimator=Pipeline(memory=None,
  3. steps=[('scl', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
  4. dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
  5. lowercase=True, max_df=1.0, max_features=None, min_df=1,
  6. ngram_range=(1, 1), preprocessor=None,
  7. stop_words=frozenset({'...,
  8. max_iter=-1, probability=False, random_state=None, shrinking=True,
  9. tol=0.001, verbose=False))]),
  10. fit_params=None, iid=True, n_jobs=-1,
  11. param_grid=[{'clf__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10, 100, 1000], 'clf__kernel': ['linear']}, {'clf__gamma': [0.0001, 0.001, 0.01, 0.1, 1.0, 10, 100, 1000], 'clf__kernel': ['rbf']}],
  12. pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
  13. scoring=None, verbose=0)
gs.best_estimator_.fit(X_train.cut_comment,y_train)
  1. Pipeline(memory=None,
  2. steps=[('scl', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
  3. dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
  4. lowercase=True, max_df=1.0, max_features=None, min_df=1,
  5. ngram_range=(1, 1), preprocessor=None,
  6. stop_words=frozenset({'...,
  7. max_iter=-1, probability=False, random_state=None, shrinking=True,
  8. tol=0.001, verbose=False))])
  1. y_pred = gs.best_estimator_.predict(X_test.cut_comment)
  2. metrics.accuracy_score(y_test,y_pred)
0.9503184713375796
临近法
  1. from sklearn.neighbors import KNeighborsClassifier
  2. knn=KNeighborsClassifier(n_neighbors=5,p=2,metric='minkowski')
  3. pipe=make_pipeline(vect,knn)
  4. pipe.fit(X_train.cut_comment, y_train)
  1. Pipeline(memory=None,
  2. steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
  3. dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
  4. lowercase=True, max_df=1.0, max_features=None, min_df=1,
  5. ngram_range=(1, 1), preprocessor=None,
  6. stop_words=...owski',
  7. metric_params=None, n_jobs=1, n_neighbors=5, p=2,
  8. weights='uniform'))])
  1. y_pred = pipe.predict(X_test.cut_comment)
  2. metrics.accuracy_score(y_test,y_pred)
0.7070063694267515
metrics.confusion_matrix(y_test,y_pred)
  1. array([[ 87, 202],
  2. [ 28, 468]], dtype=int64)
决策树
  1. from sklearn.tree import DecisionTreeClassifier
  2. tree=DecisionTreeClassifier(criterion='entropy',random_state=1)
  1. pipe=make_pipeline(vect,tree)
  2. pipe.fit(X_train.cut_comment, y_train)
  3. y_pred = pipe.predict(X_test.cut_comment)
  4. metrics.accuracy_score(y_test,y_pred)
0.9388535031847134
metrics.confusion_matrix(y_test,y_pred)
  1. array([[256, 33],
  2. [ 15, 481]], dtype=int64)
随机森林
  1. from sklearn.ensemble import RandomForestClassifier
  2. forest=RandomForestClassifier(criterion='entropy',random_state=1,n_jobs=2)
  3. pipe=make_pipeline(vect,forest)
  4. pipe.fit(X_train.cut_comment, y_train)
  5. y_pred = pipe.predict(X_test.cut_comment)
  6. metrics.accuracy_score(y_test,y_pred)
  7. # 加上tfidf反而准确率96.5降低至95.0,
0.9656050955414013
metrics.confusion_matrix(y_test,y_pred)
  1. array([[265, 24],
  2. [ 3, 493]], dtype=int64)
bagging方法
  1. from sklearn.ensemble import BaggingClassifier
  2. from sklearn.tree import DecisionTreeClassifier
  3. tree=DecisionTreeClassifier(criterion='entropy',random_state=1)
  4. bag=BaggingClassifier(base_estimator=tree,
  5. n_estimators=10,
  6. max_samples=1.0,
  7. max_features=1.0,
  8. bootstrap=True,
  9. bootstrap_features=False,
  10. n_jobs=1,random_state=1)
  11. pipe=make_pipeline(vect,tfidf,bag)
  12. pipe.fit(X_train.cut_comment, y_train)
  13. y_pred = pipe.predict(X_test.cut_comment)
  14. metrics.accuracy_score(y_test,y_pred) # 没用转化td-idf 93.2%, 加上转化步骤,准确率提升到95.5
0.9554140127388535
metrics.confusion_matrix(y_test,y_pred)
  1. array([[260, 29],
  2. [ 6, 490]], dtype=int64)

posted on 2018-09-20 00:04 多一点 阅读(...) 评论(...) 编辑 收藏

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/article/detail/47165
推荐阅读
相关标签
  

闽ICP备14008679号