赞
踩
In:
pip install nltk
out:
Requirement already satisfied: nltk in d:\programdata\anaconda3\lib\site-packages (3.4.5)
Requirement already satisfied: six in d:\programdata\anaconda3\lib\site-packages (from nltk) (1.12.0)
Note: you may need to restart the kernel to use updated packages.
In:
import nltk
nltk.download()
out:
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
showing info http://nltk.org/nltk_data/
True
In:
sogou = pd.read_table("../data/sogou.txt",header=None,
names=['category','title','url','content'])
stopwords = pd.read_table("../data/stopwords.txt",header=None,
names=['stopword'],quoting=3) #3表示QUOTE_NONE
In:
# pip install jieba
import jieba
contents = []
for cont in sogou['content']:
contents.append(jieba.lcut(cont)) #list[list]
In:
contents_new = []
for cont in contents:
contents_new.append([word for word in cont
if word not in stopwords['stopword'].tolist()])
In:
#列表转字符串
contents_list = [" ".join(wlist) for wlist in contents_new]
In:
from sklearn.feature_extraction.text import TfidfVectorizer #CountVectorizer 词频
tfidf = TfidfVectorizer()
In:
content_vec = tfidf.fit_transform(contents_list)
In:
#拆分训练集和测试集
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(content_vec,sogou['category'])
In:
from sklearn.linear_model import LogisticRegression
logic = LogisticRegression()
logic.fit(x_train,y_train)
logic.score(x_test,y_test) #0.8208
out:
0.8208
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。