Previously, I showed you how to create N-Gram frequency tables from large text datasets. Unfortunately, when used on very large datasets such as the English language Wikipedia and Gutenberg corpora, memory limitations limited these scripts to unigrams. Here, I show you how to use the BerkeleyDB database to create N-gram tables of these large datasets.
Large datasets such as the Wikipedia and Gutenberg English language corpora cannot be used to create N-gram frequency tables using the previous script due to the script’s large in-memory requirements. The solution is to create the frequency table as a disk-based dataset. For this, the BerkeleyDB database in key-value mode is ideal. This is an open source “NoSQL” library which supports a disk based database and in-memory caching. BerkeleyDB can be downloaded from the Oracle website, and also ships with a number of Linux distributions, including Ubuntu. To use BerkeleyDB from Python, you will need the bsddb3 package. This is included with Python 2.* but is an additional download for Python 3 installations.
Here is the modified script that uses BerkeleyDB. First we have the various imports:
# Module of functions for calculating N-Gram frequency tables # Originally written to create word tables for the Gutenberg and Wikipedia Corpora # Due to their excessive size, output is to a BerkeleyDB import string import sys import os import re import gc import shutil # Berkeley DB I/O from bsddb3 import db import cPickle as pickle # Import the SentenceTokenizer (in module word_parser.py) from word_parser import SentenceTokenizer from nltk.probability import FreqDist
Next, we define a table FreqTableDB that encapsulates the BerkeleyDB key-value table and implements a frequency table. In the constructor, the BerkeleyDB in-memory cache is set to 1GB. You can set this larger if you have sufficient memory for all processes (including the remainder of this script). The class also exposes the BerkeleyDB’s cursor to enable simple iteration through the data.
Also of note is the calculate_total() method which calculates the total number of samples by iterating through the entire database. As well as returning the total number of samples, the total is saved in a special key “__total__”. This can be quickly read, instead of calling calculate_total() every time the total sample count is required.
Here is the FreqTableDB code:
class FreqTableDB( object ): # Open file for read or write/append # Claler should delete file, if new db is required def __init__ (self, fname, bAppend): self.bAppend = bAppend # Open for write/append? self.filename = fname self.dbTable = db.DB() self.dbTable.set_cachesize(1,0) # 1GB cache if (bAppend): self.dbTable.open(fname,None, db.DB_HASH, db.DB_DIRTY_READ | db.DB_CREATE ) else: self.dbTable.open(fname,None, db.DB_HASH, db.DB_DIRTY_READ ) # Close this database def close(self): if (self.dbTable is not None): self.dbTable.close() self.dbTable = None # Flush all changes or cache to disk def flush(self): if (self.bAppend): self.dbTable.sync() # Increment the count for a particular word def increment(self, word, inc=1): if (self.bAppend and self.dbTable is not None): v = self.get(word) pk = pickle.dumps(v+inc, pickle.HIGHEST_PROTOCOL) self.dbTable.put(word, pk ) # Query the count for a particular word # "not found" implies a value of zero def get(self, word): if (self.dbTable is not None): pk = self.dbTable.get( word ) if (pk is not None): return pickle.loads(pk) return 0 # Fetch a cursor (standard bsddb3 cursor) # Cursor should be read with cursor_key, cursor_value def cursor(self): if (self.dbTable is not None): return self.dbTable.cursor() return None # Fetch the key (word) for the current cursor tuple ( cursor.next() ) def cursor_key(self, cursor_tuple): if (cursor_tuple is not None): return cursor_tuple[0] return None # Fetch the key (value) for the current cursor tuple def cursor_value(self, cursor_tuple): if (cursor_tuple is not None): return pickle.loads( cursor_tuple[1] ) return None # Calculate the total count for the entire table # Count is returned, and also saved as "__total__" def calculate_total(self): if (self.dbTable is not None): cursor = self.cursor() rec = cursor.first() nTotal = 0L while rec: if (rec[0] != '__total__'): nTotal = nTotal + self.cursor_value(rec) rec = cursor.next() pk = pickle.dumps(nTotal, pickle.HIGHEST_PROTOCOL) self.dbTable.put('__total__', pk ) return nTotal return 0
Next, we modify the WordFrequencyBuilder class to use the above BerkeleyDB class. Despite BerkeleyDB’s own in-memory caching, it is generally quicker to create small in-memory (NLTK FreqDist) frequency tables for groups of input data files. These are then periodically written to the database. This approach greatly reduces the number of required BerkeleyDB updates.
The builder class has also had some minor modifications regarding the treatment of numbers and punctuation. Here is the updated WordFrequencyBuilder class:
class WordFrequencyBuilder(object): # n = Length of N-Gram def __init__ (self, n, fname): self.N = n # Create one sentence tokenizer for re-use as required self.myTokenizer = SentenceTokenizer() # Create an empty (local cache) frequency distribution self.myFD = FreqDist() # Create empty frequency distributions for NGrams #self.myBigramFD = FreqDist() #self.myTrigramFD = FreqDist() # Create master frequency distribution self.myDB = FreqTableDB(fname, True) # regex for punctuation (breaks N-grams) self.regexPunct = re.compile(r'^[\W_]+$') # regex for numbers - includes partial word/numbers self.regexNum = re.compile(r'\d+') self.regexASCIIPrintable = re.compile(r'^[\x20-\x7E]+$') self.regexStripL = re.compile(r'^[\W_]+') self.regexStripR = re.compile(r'[\W_]+$') self.regexAllUnderscores = re.compile(r'^_+$') def close(self): self.myDB.flush() self.myDB.close() # Accessors def DB(self): return self.myDB # Utility functions # Strip any spaces or punctuation prefixes/suffixes # Note: If word is JUST punctuation, then it is passed through as-is # Except all underscores which are converted to empty-string def strip_word(self, word): wd = word.strip() if (self.regexAllUnderscores.match(wd)): return '' # empty string - ignore all underscores if (self.regexPunct.match(wd) ): return wd # pure punctuation - no change wd = self.regexStripL.sub( '', wd ) wd = self.regexStripR.sub( '', wd ) return wd # Used by buildTableForFile() to process a section of text for # and N-gram # Note: Always skips punctuation, and punctuation breaks an N-gram # Text is assumed to be a paragraph # text: Text to process # N: Length of n grams # incl_num, incl_punct: Should numbers or punctuation be included? # If N>1, sentence start/end is recorded as '<s>' and '</s>' # If N=1, <s>, </s> are all skipped def processText(self, text, incl_num, incl_punct): # segment the text into words and sentences # Only words are required, but sentence segmentation is involved # because we want to interpret full stops correctly sentences = self.myTokenizer.segment_text(text) for sentence in sentences: n_gram = ['<s>'] for word in sentence: wd = self.strip_word(word.lower()) if (len(wd) ==0): continue if (not self.regexASCIIPrintable.match(wd)): # unrecognized word - reset the ngram n_gram = [ ] continue if (self.regexNum.match(wd) ): # numeric if (not incl_num): n_gram = [ ] # reset ngram wd = "" continue elif (self.regexPunct.match(wd) ): # punctuation if (not incl_punct): # skip punctuation n_gram = [ ] # reset ngram continue # if okay, add word (or symbol) n_gram.append( wd ) # Shrink N-gram if it has grown too big if (len(n_gram) > self.N): n_gram.pop(0) # save if big enough if (len(n_gram) == self.N): self.myFD.inc( " ".join(n_gram) ) # sentence finish - write a sentence marker if N>1 if (self.N>1): n_gram.append( '</s>') if (len(n_gram) > self.N): n_gram.pop(0) if (len(n_gram)==self.N): self.myFD.inc( " ".join(n_gram) ) # Add the words for a file to this frequency table # fname: Full path file name to read (plain text only) # include_numsym: True if you wish to include numbers and # symbol/punctuation tokens # Returns a reference to our frequency distribution # Note: This distribution is accumulative. Create a new class # to reset the table def buildTableForFile(self, fname, incl_num, incl_punct): # Read the text as one big string f = open(fname,"r") lines_text = f.readlines() f.close() # Process text in paragraphs, using empty lines as paragraph markers # This avoids inefficient text processing and re-allocations full_text = "" for s in lines_text: ss = s.strip() if (len(ss) == 0 and len(full_text)>0): # Empty line => process what we have self.processText(full_text, incl_num, incl_punct) full_text = "" else: # Accumulate this line full_text = full_text + " " + ss # Process any remaining text if (len(full_text)>0): self.processText(full_text, incl_num, incl_punct) # Add the words for all files in the supplied directory, to this # frequency table. Directory is recursed if necessary # All files should be plain text. # path: Full path to the directory to read # include_numsym: True if you wish to include numbers and # symbol/puncuation tokens # Returns a reference to our frequency distribution # Note: This distribution is accumulative. Create a new class # to reset the table def buildTableForTextDir(self, path, incl_num, incl_punct): counter=0 for dirname, dirnames, filenames in os.walk(path): for f in filenames: infile = os.path.join(dirname, f) self.buildTableForFile(infile, incl_num, incl_punct) #print "Size of table=", self.myFD.B() counter = counter+1 if ( (counter % 100) == 0): gc.collect() # aggresively garbage collect print counter if ( (counter % 1000) == 0): # Copy FD to the master print " (saving local table)" for bigram in self.myFD: bb = bigram.strip() if (len(bb)>0): # if ( not self.regexNum.match(bb) ): self.myDB.increment(bb, self.myFD[bigram] ) print " (flushing changes)" self.myDB.flush() self.myFD = FreqDist() gc.collect() print counter," No. of keys:",self.myDB.dbTable.stat()["nkeys"] # All files finished, Finalize remaining in-memory data # Copy FD to the master print " (saving local table)" for bigram in self.myFD: bb = bigram.strip() if (len(bb)>0): # if ( not self.regexNum.match(bb) ): self.myDB.increment(bb, self.myFD[bigram] ) print " (flushing changes)" self.myDB.flush() self.myFD = FreqDist() gc.collect() print counter,"Final: No. of keys:",self.myDB.dbTable.stat()["nkeys"]
Finally we define a script main, allowing the script to be used from the command line. This also serves as a usage example:
# Main script - create NGram frequency table for the supplied path # Usage: python build_ngram_db.py N /my/input/path /my/output/table.bdb # N = Size of ngram if __name__ == '__main__': if (len(sys.argv) != 4): sys.stderr.write("Usage: python %s N inputpath outputfile\n" % sys.argv[0]) raise SystemExit(1) Ngram = int(sys.argv[1]) input_path = sys.argv[2] fname = sys.argv[3] print "NGram size: ", int(Ngram) # Remove any existing table #comment this out if you wish to append to an existing file if (os.path.exists(fname)): os.remove(fname) print "Scanning for NGrams..." myNGramWF = WordFrequencyBuilder(Ngram,fname) fd = myNGramWF.buildTableForTextDir( input_path, False, True) myDB = myNGramWF.DB() # Display the first 200 bigrams as a simple demo cursor = myDB.cursor() rec = cursor.first() cc =0 while rec and (cc<200): print ">"+myDB.cursor_key(rec)+"<->"+str(myDB.cursor_value(rec)) rec= cursor.next() cc = cc+1 nTotal = myDB.calculate_total() print "Total samples=",nTotal myNGramWF.close()
And that is it! This script will still take days to process a large dataset such as the Wikipedia pages, but it will do this without running out of memory. If memory does pose a problem, restrict the BerkeleyDB cache further, and write the NLTK FreqDist cache out more frequently (e.g. every 200 input files instead of every 1000).
The resulting BerkeleyDB databases are too large to make available as a simple download. Instead, I shall be making them available as an Azure web service in the next few weeks. Availability and demo code will be published here.