python – 删除词汇表TF-IDF中单个出现的单词

我试图删除在我的词汇中出现一次的单词以减少我的词汇量.我正在使用sklearn TfidfVectorizer(),然后在我的数据框上使用fit_transform函数.

tfidf = TfidfVectorizer()  
tfs = tfidf.fit_transform(df['original_post'].values.astype('U')) 

我首先想到的是tfidf矢量化器中的预处理器字段,或者在机器学习之前使用预处理包.

任何进一步实施的提示或链接?

最佳答案
你正在寻找min_df param(最低频率),来自scikit-learn TfidfVectorizer的文档:

min_df : float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also
called cut-off in the literature. If float, the parameter represents a
proportion of documents, integer absolute counts. This parameter is
ignored if vocabulary is not None.

# remove words occuring less than 5 times
tfidf = TfidfVectorizer(min_df=5)

你也可以删除常用词:

# remove words occuring in more than half the documents
tfidf = TfidfVectorizer(max_df=0.5)

你也可以删除这样的停用词:

tfidf = TfidfVectorizer(stop_words='english')

转载注明原文:python – 删除词汇表TF-IDF中单个出现的单词 - 代码日志