python - all possible wordform completions of a (biomedical) word's stem -


i'm familiar word stemming , completion tm package in r.

i'm trying come quick , dirty method finding variants of given word (within corpus.) example, i'd "leukocytes" , "leuckocytic" if input "leukocyte".

if had right now, go like:

library(tm) library(rweka) dictionary <- unique(unlist(lapply(crude, words))) grep(pattern = lovinsstemmer("company"),      ignore.case = t, x = dictionary, value = t) 

i used lovins because snowball's porter doesn't seem aggressive enough.

i'm open suggestions other stemmers, scripting languages (python?), or entirely different approaches.

this solution requires preprocessing corpus. once done quick dictionary lookup.

from collections import defaultdict stemming.porter2 import stem  open('/usr/share/dict/words') f:     words = f.read().splitlines()  stems = defaultdict(list)  word in words:     word_stem = stem(word)     stems[word_stem].append(word)  if __name__ == '__main__':     word = 'leukocyte'     word_stem = stem(word)     print(stems[word_stem]) 

for /usr/share/dict/words corpus, produces result

['leukocyte', "leukocyte's", 'leukocytes'] 

it uses stemming module can installed

pip install stemming 

Comments

Popular posts from this blog

python - pip install -U PySide error -

arrays - C++ error: a brace-enclosed initializer is not allowed here before ‘{’ token -

cytoscape.js - How to add nodes to Dagre layout with Cytoscape -