python - merge and split synonym word list -


(i trying update hunspell spelling dictionary) synonym file looks this...

mylist=""" specimen|3  sample prototype example sample|3 prototype example specimen prototype|3 example specimen sample example|3  specimen sample prototype protoype|1 illustration """ 

the first step merge duplicate words. in example mentioned above, word "prototype" repeated. need club together. count change 3 4 because "illustration" synonym added.

specimen|3  sample prototype example sample|3 prototype example specimen prototype|4 example specimen sample illustration example|3  specimen sample prototype 

the second step more complicated. not enough merge duplicates. added word should reflected linked words. in case need search "prototype" in synonym list , if found, "illustration" word should added. final list of words this...

specimen|4 sample prototype example illustration sample|4 prototype example specimen illustration prototype|4 example specimen sample illustration example|4  specimen sample prototype illustration 

a new word "illustration" should added original list 4 linked words.

illustration|4 example specimen sample prototype 

what have tried:

myfile=stringio.stringio(mylist) lineno, in enumerate(myfile):     if i:         try:             if int(i.split("|")[1]) > 0:                 print lineno, i.split("|")[0], int(i.split("|")[1])         except:             pass 

the above code returns word line numbers , count.

1 specimen 3 5 sample 3 9 prototype 3 13 example 3 17 protoype 1 

it means need merge 1 word on line number 18 word found on line number 9 ("prototype") @ 4th position. if can this, complete step 1 of task.

the problem described classical union-find problem, can solved disjoint set algorithm. don't re-invent wheel.

read union-find/disjoint set:

http://en.wikipedia.org/wiki/disjoint-set_data_structure

or questions:

a set union find algorithm

union find implementation using python

class disjointset(object): def __init__(self):     self.leader = {} # maps member group's leader     self.group = {} # maps group leader group (which set)  def add(self, a, b):     leadera = self.leader.get(a)     leaderb = self.leader.get(b)     if leadera not none:         if leaderb not none:             if leadera == leaderb: return # nothing             groupa = self.group[leadera]             groupb = self.group[leaderb]             if len(groupa) < len(groupb):                 a, leadera, groupa, b, leaderb, groupb = b, leaderb, groupb, a, leadera, groupa             groupa |= groupb             del self.group[leaderb]             k in groupb:                 self.leader[k] = leadera         else:             self.group[leadera].add(b)             self.leader[b] = leadera     else:         if leaderb not none:             self.group[leaderb].add(a)             self.leader[a] = leaderb         else:             self.leader[a] = self.leader[b] =             self.group[a] = set([a, b])  mylist=""" specimen|3  sample prototype example sample|3 prototype example specimen prototype|3 example specimen sample example|3  specimen sample prototype prototype|1 illustration specimen|1 cat happy|2 glad cheerful  """ ds = disjointset() line in mylist.strip().splitlines():     if '|' in line:          node, _ = line.split('|')     else:          ds.add(node, line)  _,g in ds.group.items():     print g  >>>  set(['specimen', 'illustration', 'cat', 'sample', 'prototype', 'example']) set(['cheerful', 'glad', 'happy']) 

using dijkstra algorithm can solve problem, think it's overkill, since don't need shortest distance between nodes, need connected components in graph.


Comments

Popular posts from this blog

python - pip install -U PySide error -

arrays - C++ error: a brace-enclosed initializer is not allowed here before ‘{’ token -

apache - setting document root in antoher partition on ubuntu -