赞
踩
计算词汇的TF-IDF的方法:
(1)第一种方法是在用CountVectorizer类向量化之后再调用TfidTransformer类进行预处理
(2)第二种方法是直接用TfidfVectorizer完成向量化与TF-IDF预处理
from sklearn.feature_extraction.text import TfidfVectorizer#特征值提取 text = ["You may be out of my sight, but never out of my mind.", "Love is not a maybe thing. You know when you love someone." "Life is a journey, not the destination, but the scenery along the should be and the mood at the view."] #1、创建变换函数 vectorizer = TfidfVectorizer() #2.词条化以及创建词汇表 vectorizer.fit(text) #3.特征以及每个特征(词)的IDF print('特征:',vectorizer.get_feature_names) print('特征的IDF:',vectorizer.idf_) #4.编码文档 vector = vectorizer.transform([text[0]]) X = vectorizer.fit_transform(text) print("TF-IDF矩阵:",X.toarray()) #总结编码文档 print(vector.shape) print('矩阵:',vector.toarray()) #它的计算IDF值公式为IDF(t)=log((1+训练集文本总数)/(1+包含词t的文本数))+1 #sklearn计算时采用了平滑处理
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。