python - merge and split synonym word list -
(i trying update hunspell spelling dictionary) synonym file looks this...
mylist=""" specimen|3 sample prototype example sample|3 prototype example specimen prototype|3 example specimen sample example|3 specimen sample prototype protoype|1 illustration """
the first step merge duplicate words. in example mentioned above, word "prototype" repeated. need club together. count change 3 4 because "illustration" synonym added.
specimen|3 sample prototype example sample|3 prototype example specimen prototype|4 example specimen sample illustration example|3 specimen sample prototype
the second step more complicated. not enough merge duplicates. added word should reflected linked words. in case need search "prototype" in synonym list , if found, "illustration" word should added. final list of words this...
specimen|4 sample prototype example illustration sample|4 prototype example specimen illustration prototype|4 example specimen sample illustration example|4 specimen sample prototype illustration
a new word "illustration" should added original list 4 linked words.
illustration|4 example specimen sample prototype
what have tried:
myfile=stringio.stringio(mylist) lineno, in enumerate(myfile): if i: try: if int(i.split("|")[1]) > 0: print lineno, i.split("|")[0], int(i.split("|")[1]) except: pass
the above code returns word line numbers , count.
1 specimen 3 5 sample 3 9 prototype 3 13 example 3 17 protoype 1
it means need merge 1 word on line number 18 word found on line number 9 ("prototype") @ 4th position. if can this, complete step 1 of task.
the problem described classical union-find problem, can solved disjoint set algorithm. don't re-invent wheel.
read union-find/disjoint set:
http://en.wikipedia.org/wiki/disjoint-set_data_structure
or questions:
union find implementation using python
class disjointset(object): def __init__(self): self.leader = {} # maps member group's leader self.group = {} # maps group leader group (which set) def add(self, a, b): leadera = self.leader.get(a) leaderb = self.leader.get(b) if leadera not none: if leaderb not none: if leadera == leaderb: return # nothing groupa = self.group[leadera] groupb = self.group[leaderb] if len(groupa) < len(groupb): a, leadera, groupa, b, leaderb, groupb = b, leaderb, groupb, a, leadera, groupa groupa |= groupb del self.group[leaderb] k in groupb: self.leader[k] = leadera else: self.group[leadera].add(b) self.leader[b] = leadera else: if leaderb not none: self.group[leaderb].add(a) self.leader[a] = leaderb else: self.leader[a] = self.leader[b] = self.group[a] = set([a, b]) mylist=""" specimen|3 sample prototype example sample|3 prototype example specimen prototype|3 example specimen sample example|3 specimen sample prototype prototype|1 illustration specimen|1 cat happy|2 glad cheerful """ ds = disjointset() line in mylist.strip().splitlines(): if '|' in line: node, _ = line.split('|') else: ds.add(node, line) _,g in ds.group.items(): print g >>> set(['specimen', 'illustration', 'cat', 'sample', 'prototype', 'example']) set(['cheerful', 'glad', 'happy'])
using dijkstra algorithm can solve problem, think it's overkill, since don't need shortest distance between nodes, need connected components in graph.
Comments
Post a Comment