python - all possible wordform completions of a (biomedical) word's stem -
i'm familiar word stemming , completion tm package in r.
i'm trying come quick , dirty method finding variants of given word (within corpus.) example, i'd "leukocytes" , "leuckocytic" if input "leukocyte".
if had right now, go like:
library(tm) library(rweka) dictionary <- unique(unlist(lapply(crude, words))) grep(pattern = lovinsstemmer("company"), ignore.case = t, x = dictionary, value = t)
i used lovins because snowball's porter doesn't seem aggressive enough.
i'm open suggestions other stemmers, scripting languages (python?), or entirely different approaches.
this solution requires preprocessing corpus. once done quick dictionary lookup.
from collections import defaultdict stemming.porter2 import stem open('/usr/share/dict/words') f: words = f.read().splitlines() stems = defaultdict(list) word in words: word_stem = stem(word) stems[word_stem].append(word) if __name__ == '__main__': word = 'leukocyte' word_stem = stem(word) print(stems[word_stem])
for /usr/share/dict/words
corpus, produces result
['leukocyte', "leukocyte's", 'leukocytes']
it uses stemming
module can installed
pip install stemming
Comments
Post a Comment