NLP-加快单词相似度匹配

我试图在熊猫数据框中找到两个词之间的最大相似性.这是我的惯例

import pandas as pd
from nltk.corpus import wordnet
import itertools

df = pd.DataFrame({'word_1':['desk', 'lamp', 'read'], 'word_2':['call','game','cook']})

def max_similarity(row):
    word_1 = row['word_1']
    word_2 = row['word_2']

    ret_val = max([(wordnet.wup_similarity(syn_1, syn_2) or 0) for 
       syn_1, syn_2 in itertools.product(wordnet.synsets(word_1), wordnet.synsets(word_2))])

    return ret_val

df['result'] = df.apply(lambda x: max_similarity(x), axis= 1)

它工作正常,但速度太慢.我正在寻找一种加快速度的方法. wordnet花费大量时间有什么建议吗? Cython?我愿意使用spacy等其他软件包.

最佳答案
因为您说过可以使用spacy作为NLP库,所以让我们考虑一个简单的基准.我们将使用棕色新闻语料库将其分成两半来创建一些任意的单词对.

from nltk.corpus import brown

brown_corpus = list(brown.words(categories='news'))
brown_df = pd.DataFrame({
    'word_1':brown_corpus[:len(brown_corpus)//2],
    'word_2': brown_corpus[len(brown_corpus)//2:]
})

len(brown_df)
50277

可以使用Doc.similarity方法计算两个标记/文档的余弦相似度.

import spacy
nlp = spacy.load('en')

def spacy_max_similarity(row):
    word_1 = nlp(row['word_1'])
    word_2 = nlp(row['word_2'])

    return word_1.similarity(word_2)

最后,将这两种方法应用于数据框:

nltk_similarity = %timeit -o brown_df.apply(nltk_max_similarity, axis=1)
1 loop, best of 3: 59 s per loop

spacy_similarity = %timeit -o brown_df.apply(spacy_max_similarity, axis=1)
1 loop, best of 3: 8.88 s per loop

请注意,NLTK和spacy在衡量相似性时使用不同的技术. spacy使用已使用word2vec算法预先训练的单词向量.从docs

Using word vectors and semantic similarities

[…]

The default English model installs vectors for one million vocabulary
entries, using the 300-dimensional vectors trained on the Common Crawl
corpus using the 07003 algorithm. The GloVe common crawl vectors have
become a de facto standard for practical NLP.

nltk word similarity vs. spacy

转载注明原文:NLP-加快单词相似度匹配 - 代码日志