python - merge and split synonym word list -
(i trying update hunspell spelling dictionary) synonym file looks this...
mylist=""" specimen|3  sample prototype example sample|3 prototype example specimen prototype|3 example specimen sample example|3  specimen sample prototype protoype|1 illustration """ the first step merge duplicate words. in example mentioned above, word "prototype" repeated. need club together. count change 3 4 because "illustration" synonym added.
specimen|3  sample prototype example sample|3 prototype example specimen prototype|4 example specimen sample illustration example|3  specimen sample prototype the second step more complicated. not enough merge duplicates. added word should reflected linked words. in case need search "prototype" in synonym list , if found, "illustration" word should added. final list of words this...
specimen|4 sample prototype example illustration sample|4 prototype example specimen illustration prototype|4 example specimen sample illustration example|4  specimen sample prototype illustration a new word "illustration" should added original list 4 linked words.
illustration|4 example specimen sample prototype what have tried:
myfile=stringio.stringio(mylist) lineno, in enumerate(myfile):     if i:         try:             if int(i.split("|")[1]) > 0:                 print lineno, i.split("|")[0], int(i.split("|")[1])         except:             pass the above code returns word line numbers , count.
1 specimen 3 5 sample 3 9 prototype 3 13 example 3 17 protoype 1 it means need merge 1 word on line number 18 word found on line number 9 ("prototype") @ 4th position. if can this, complete step 1 of task.
the problem described classical union-find problem, can solved disjoint set algorithm. don't re-invent wheel.
read union-find/disjoint set:
http://en.wikipedia.org/wiki/disjoint-set_data_structure
or questions:
union find implementation using python
class disjointset(object): def __init__(self):     self.leader = {} # maps member group's leader     self.group = {} # maps group leader group (which set)  def add(self, a, b):     leadera = self.leader.get(a)     leaderb = self.leader.get(b)     if leadera not none:         if leaderb not none:             if leadera == leaderb: return # nothing             groupa = self.group[leadera]             groupb = self.group[leaderb]             if len(groupa) < len(groupb):                 a, leadera, groupa, b, leaderb, groupb = b, leaderb, groupb, a, leadera, groupa             groupa |= groupb             del self.group[leaderb]             k in groupb:                 self.leader[k] = leadera         else:             self.group[leadera].add(b)             self.leader[b] = leadera     else:         if leaderb not none:             self.group[leaderb].add(a)             self.leader[a] = leaderb         else:             self.leader[a] = self.leader[b] =             self.group[a] = set([a, b])  mylist=""" specimen|3  sample prototype example sample|3 prototype example specimen prototype|3 example specimen sample example|3  specimen sample prototype prototype|1 illustration specimen|1 cat happy|2 glad cheerful  """ ds = disjointset() line in mylist.strip().splitlines():     if '|' in line:          node, _ = line.split('|')     else:          ds.add(node, line)  _,g in ds.group.items():     print g  >>>  set(['specimen', 'illustration', 'cat', 'sample', 'prototype', 'example']) set(['cheerful', 'glad', 'happy']) using dijkstra algorithm can solve problem, think it's overkill, since don't need shortest distance between nodes, need connected components in graph.
Comments
Post a Comment