Following on from my previous post about NLTK Trees, here is a short Python function to extract phrases from an NLTK Tree structure.
Recently I needed to extract noun phrases from a section of text. This was in an attempt to choose “interesting concept phrases”. N-gram collocations are a common way of performing this, but these also resulted in partial phrases that poorly defined a concept. Most of the phrases I was interested in were noun phrases, so I chose to tag and chunk the text. The noun phrases (tagged ‘NP’) were then extracted from the chunked tree structures.
Here is the code:
from nltk.tree import * # Tree manipulation # Extract phrases from a parsed (chunked) tree # Phrase = tag for the string phrase (sub-tree) to extract # Returns: List of deep copies; Recursive def ExtractPhrases( myTree, phrase): myPhrases = [] if (myTree.node == phrase): myPhrases.append( myTree.copy(True) ) for child in myTree: if (type(child) is Tree): list_of_phrases = ExtractPhrases(child, phrase) if (len(list_of_phrases) > 0): myPhrases.extend(list_of_phrases) return myPhrases
This function iterates through the tree, finding all sub-trees with matching tags (‘NP’ in my application), and returning a list of deep copies of these sub-trees.
Here is an example of the function’s usage:
test = Tree.parse('(S (NP I) (VP (V enjoyed) (NP my cookies)))') print "Input tree: ", test print "\nNoun phrases:" list_of_noun_phrases = ExtractPhrases(test, 'NP') for phrase in list_of_noun_phrases: print " ", phrase
This function is a simple demonstration of how the Tree structure can be easily processed using short functions.
It is written in a very procedural manner and is neither very functional nor Pythonic. Perhaps you could write a more elegant and Pythonic version?