It looks long but it does the work:
ner_output = [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]
chunked, pos = [], ""
for i, word_pos in enumerate(ner_output):
word, pos = word_pos
if pos in ['PERSON', 'ORGANIZATION', 'LOCATION'] and pos == prev_tag:
chunked[-1]+=word_pos
else:
chunked.append(word_pos)
prev_tag = pos
clean_chunked = [tuple([" ".join(wordpos[::2]), wordpos[-1]]) if len(wordpos)!=2 else wordpos for wordpos in chunked]
print clean_chunked
[out]:
[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican Party', u'ORGANIZATION')]
For more details:
The first for-loop "with memory" achieves something like this:
[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')]
You'll realize that all Name Enitties will have more than 2 items in a tuple and what you want are the words as the elements in the list, i.e. 'Republican Party'
in (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')
, so you'll do something like this to get the even elements:
>>> x = [0,1,2,3,4,5,6]
>>> x[::2]
[0, 2, 4, 6]
>>> x[1::2]
[1, 3, 5]
Then you also realized that the last element in the NE tuple is the tag you want, so you would do `
>>> x = (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')
>>> x[::2]
(u'Republican', u'Party')
>>> x[-1]
u'ORGANIZATION'
It's a little ad-hoc and vebose but I hope it helps. And here it is in a function, Blessed Christmas:
ner_output = [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]
def rechunk(ner_output):
chunked, pos = [], ""
for i, word_pos in enumerate(ner_output):
word, pos = word_pos
if pos in ['PERSON', 'ORGANIZATION', 'LOCATION'] and pos == prev_tag:
chunked[-1]+=word_pos
else:
chunked.append(word_pos)
prev_tag = pos
clean_chunked = [tuple([" ".join(wordpos[::2]), wordpos[-1]])
if len(wordpos)!=2 else wordpos for wordpos in chunked]
return clean_chunked
print rechunk(ner_output)