Now that we can segment words and sentences, it is possible to produce word and tuple frequency tables. Here I show you how to create a word frequency table for a large collection of text files.
Word frequency tables are useful for a wide range of applications, including collocation detection, spelling correction, and n-gram modeling. This article concentrates on simple word frequencies, but the code can (and will) be extended to also calculate n-grams.
Applications that require word frequency tables, require the frequencies across the problem domain. Typically the domain is a subset of texts (e.g. news articles) but it is frequently across larger domains (e.g. English literature). Either way, the sample text is probably going to be large and consist of many (possible thousands or tens of thousands) of texts. The script below was written to create a word frequency table for all of the text files in a directory. It has been used with both the entire English language Gutenberg library, and the entire set of Wikipedia English pages. Preparing these texts will be covered in future articles.
Frequency tables are stored using NLTK’s FreqDist class. This derives from a standard Python Dictionary, but stores a word count for each word (key). This allows the resulting table to be used by NLTK, if so desired.
Words are segmented using my own word and sentence segmenter. Punctuation and numbers are dropped from the word counts, but these could be easily included if your application requires it. Acronyms, abbreviations, and acronyms that include numeric digits are included. E.g. “1970” is dropped because it is a number, but “1970s” (an abbreviation) is not.
Here is the code:
# Module of functions for calculating word frequency tables # These can work on in-memory text, a text file, or a directory # Also includes functions for combining tables, etc. # Originally written to create word tables for the Gutenberg and # Wikipedia Corpora import string import sys import os import re import gc import shutil # Import the SentenceTokenizer (in module word_parser.py) from word_parser import SentenceTokenizer from nltk.probability import FreqDist class WordFrequencyBuilder(object): def __init__ (self, diagnostics): self.diagnostics = diagnostics # Create one sentence tokenizer for re-use as required self.myTokenizer = SentenceTokenizer() # Create an empty frequency distribution self.myFD = FreqDist() # regex for 'words' to drop (numbers/punctuation) self.regexDrop = re.compile(r'^[\d\WeE\_]+$') # Accessors # Return a reference to the frequency distribution # This is an NLTK FreqDist object def FD(self): return self.myFD # Used by buildTableForFile() to process a section of text for word count # This text is assumed to be a paragraph # text: Text to process def processText(self, text): # segment the text into words and sentences # Only words are required, but sentence segmentation is involved # because we want to intepret full stops correctly sentences = self.myTokenizer.segment_text(text) for sentence in sentences: for word in sentence: if (not self.regexDrop.match(word) ): self.myFD.inc( word.lower().strip() ) # Add the words for a file to this frequency table # fname: Full path file name to read (plain text only) # include_numsym: True if you wish to include numbers and # symbol/punctuation tokens # Note: This distribution is accumulative. Create a new class # to reset the table def buildTableForFile(self, fname): # Read the text as one big string f = open(fname,"r") lines_text = f.readlines() f.close() # Process text in paragraphs, using empty lines as paragraph markers # This avoids inefficient text processing and re-allocations full_text = "" for s in lines_text: ss = s.strip() if (len(ss) == 0 and len(full_text)>0): # Empty line => process what we have self.processText(full_text) full_text = "" else: # Accumulate this line full_text = full_text + " " + ss # Process any remaining text if (len(full_text)>0): self.processText(full_text) # Add the words for all files in the supplied directory, to this # frequency table. Directory is recursed if necessary # All files should be plain text. # path: Full path to the directory to read # include_numsym: True if you wish to include numbers and # symbol/puncuation tokens # Returns a reference to our frequency distribution # Note: This distribution is accumulative. Create a new class # to reset the table def buildTableForTextDir(self, path): counter=0 for dirname, dirnames, filenames in os.walk(path): for f in filenames: #sys.stdout.write(f+':') #sys.stdout.flush() infile = os.path.join(dirname, f) self.buildTableForFile(infile) counter = counter+1 if ( (counter % 100) == 0): if (self.diagnostics): print counter,": B=", self.myFD.B()," N=",self.myFD.N() gc.collect() return self.myFD # Main script - create frequency table for the supplied path # Usage: python word_freqs.py /my/input/path /my/output/table.txt if __name__ == '__main__': if (len(sys.argv) != 3): sys.stderr.write("Usage: python %s inputpath outputfile\n" % sys.argv[0]) raise SystemExit(1) input_path = sys.argv[1] output_file = sys.argv[2] print "Scanning word frequencies..." myWF = WordFrequencyBuilder(True) fd = myWF.buildTableForTextDir( input_path ) print "No. of different words = ", myWF.FD().B(), '(samples=', myWF.FD().N(),')' print "Writing to file..." f = open(output_file, "w") f.write( "ALL\t{0}\n".format( fd.N() ) ) for word in fd.keys(): f.write( "{0}\t{1}\n".format( word, fd[word] ) ) f.close()
Note that this code includes a number of diagnostics to indicate the current progress. These could be removed for silent running if you are confident that it is running okay.
Also, it can be used as a module (by importing and creating a WordFrequencyBuilder class), or it can be used as a script from the command line. In the latter form, it should be called with two command line parameters: the source path, and the output file. Output is in the form of a tab-separated file, with one word per line. Tabs are used in-case punctuation is required – this avoids handling quote-escape sequences. The (lowercase) word is in the first column, and is followed by the respective count. Words are sorted with the most frequent first. The first line gives the total number of samples counted (with the word “ALL”).
The FreqDist dictionary in the WordFrequencyBuilder class is cumulative. I.e. multiple calls to buildTableForTextDir() on the same WordFrequencyBuilder object could be used to count the total word frequency across multiple directories.