TF-IDF简单预处理数据_tifid 预处理

作者：从前慢现在也慢 | 2024-06-15 18:32:07

踩

tifid 预处理

计算词汇的TF-IDF的方法：
（1）第一种方法是在用CountVectorizer类向量化之后再调用TfidTransformer类进行预处理
（2）第二种方法是直接用TfidfVectorizer完成向量化与TF-IDF预处理

from sklearn.feature_extraction.text import TfidfVectorizer#特征值提取
text = ["You may be out of my sight, but never out of my mind.",
       "Love is not a maybe thing. You know when you love someone."
       "Life is a journey, not the destination, but the scenery along the should be and the mood at the view."]

#1、创建变换函数
vectorizer = TfidfVectorizer()
#2.词条化以及创建词汇表
vectorizer.fit(text)
#3.特征以及每个特征（词）的IDF
print('特征:',vectorizer.get_feature_names)

print('特征的IDF:',vectorizer.idf_)
#4.编码文档
vector = vectorizer.transform([text[0]])

X = vectorizer.fit_transform(text)
print("TF-IDF矩阵:",X.toarray())

#总结编码文档
print(vector.shape)
print('矩阵：',vector.toarray())
#它的计算IDF值公式为IDF(t)=log((1+训练集文本总数)/(1+包含词t的文本数))+1
#sklearn计算时采用了平滑处理
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

在这里插入图片描述

本文内容由网友自发贡献，转载请注明出处：【wpsshop博客】