python - all possible wordform completions of a (biomedical) word's stem -

- May 15, 2010

i'm familiar word stemming , completion tm package in r.

i'm trying come quick , dirty method finding variants of given word (within corpus.) example, i'd "leukocytes" , "leuckocytic" if input "leukocyte".

if had right now, go like:

library(tm) library(rweka) dictionary <- unique(unlist(lapply(crude, words))) grep(pattern = lovinsstemmer("company"),      ignore.case = t, x = dictionary, value = t)

i used lovins because snowball's porter doesn't seem aggressive enough.

i'm open suggestions other stemmers, scripting languages (python?), or entirely different approaches.

this solution requires preprocessing corpus. once done quick dictionary lookup.

from collections import defaultdict stemming.porter2 import stem  open('/usr/share/dict/words') f:     words = f.read().splitlines()  stems = defaultdict(list)  word in words:     word_stem = stem(word)     stems[word_stem].append(word)  if __name__ == '__main__':     word = 'leukocyte'     word_stem = stem(word)     print(stems[word_stem])

for /usr/share/dict/words corpus, produces result

['leukocyte', "leukocyte's", 'leukocytes']

it uses stemming module can installed

pip install stemming

Search This Blog

Click Hand

python - all possible wordform completions of a (biomedical) word's stem -

Comments

Post a Comment

Popular posts from this blog

python - pip install -U PySide error -

apache - setting document root in antoher partition on ubuntu -

cytoscape.js - How to add nodes to Dagre layout with Cytoscape -