A number of NLTK functions work with Tree objects. For example, part of speech tagging and chunking classifiers, naturally return trees. Sentence manipulation functions also work with trees. Although Natural Language Processing with Python (Bird et al) includes a couple of pages about NLTK’s Tree module, coverage is generally sparse. The online documentation actually contains some good coverage although it is not always in the most logical location (e.g. the unit tests contain some very good documentation). This article is intended as a quick introduction, and the more informative documentation pages are listed under Further Reading.
NLTK’s tree module implements the Tree class. Display extensions are also available in the nltk.draw module. The samples in this article assume the following imports:
from nltk.tree import * from nltk.draw import tree
A Tree class consists of a node value (typically a string label) and a Python iterable structure containing the node’s children. The iterable can be any Python iterable except for a string, but it is typically a list. The node’s children can be of any type, but they are typically leaf labels (i.e. strings) or Tree objects. A hierarchical ‘tree’ structure is produced by nesting Tree objects.
Here are a couple of examples from the unit tests:
>>>print Tree(1, [2, 3, 4]) (1 2 3 4) >>> s = Tree('S', [Tree('NP', ['I']), Tree('VP', [Tree('V', ['saw']), Tree('NP', ['him'])])]) >>> print s (S (NP I) (VP (V saw) (NP him)))
Note that the first example has three ‘leaves’ and uses numeric labels instead of the more usual string labels. The second example is more conventional and uses string labels in nested Tree objects.
As Tree objects are simply Python objects, it is possible to manipulate sub-trees to form larger trees. For example:
>>> dp1 = Tree('dp', [Tree('d', ['the']), Tree('np', ['dog'])]) >>> dp2 = Tree('dp', [Tree('d', ['the']), Tree('np', ['cat'])]) >>> vp = Tree('vp', [Tree('v', ['chased']), dp2) >>> vp = Tree('vp', [Tree('v', ['chased']), dp2]) >>> sentence = Tree('s', [dp1, vp]) >>> print sentence (s (dp (d the) (np dog)) (vp (v chased) (dp (d the) (np cat))))
The draw extension can be used to render the tree as a drawing:
>>> sentence.draw()
produces the following drawing:
Children and nodes can both be accessed or modified in situ. For example:
>>>print sentence[1][1] (np (d the) (np cat)) >>> sentence[0], sentence[1][1] = sentence[1][1], sentence[0] >>> print sentence (s (dp (d the) (np cat)) (vp (v chased) (dp (d the) (np dog))))
A tree can also be ‘flattened’ to just its leaves:
>>> print sentence.leaves() ['the', 'cat', 'chased', 'the', 'dog']
The Tree class also contains a number of other tree-manipulation methods. It includes a parse() method which is capable of parsing string representations (i.e. the reverse of print). These are detailed further in the pages referenced below.
As well as the nltk.draw module extension, NLTK provides a number of tree class extensions in the nltk.treetransforms module. These extensions transform existing tree classes in a number of standard ways. For example, chomsky_normal_form() converts a tree to ‘Chomsky Normal Form’. Also known as binarization, this transform converts all nodes so that they only have two children. collapse_unary() removes nodes that have only one child. The module also contains a Markov smoothing transform.
Further Reading
The best overview documentation for the nltk.tree.Tree class can be found under the unit tests at: http://nltk.googlecode.com/svn/trunk/doc/howto/tree.html .
The official documentation for the nltk.treetransforms module can be found at: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.treetransforms-module.html .