Using BerkeleyDB to Create a Large N-gram Table

Richard Marsden

13 years ago

Previously, I showed you how to create N-Gram frequency tables from large text datasets. Unfortunately, when used on very large datasets such as the English language Wikipedia and Gutenberg corpora, memory limitations limited these scripts to unigrams. Here, I show you how to use the BerkeleyDB database to create N-gram tables of these large datasets.

Large datasets such as the Wikipedia and Gutenberg English language corpora cannot be used to create N-gram frequency tables using the previous script due to the script’s large in-memory requirements. The solution is to create the frequency table as a disk-based dataset. For this, the BerkeleyDB database in key-value mode is ideal. This is an open source “NoSQL” library which supports a disk based database and in-memory caching. BerkeleyDB can be downloaded from the Oracle website, and also ships with a number of Linux distributions, including Ubuntu. To use BerkeleyDB from Python, you will need the bsddb3 package. This is included with Python 2.* but is an additional download for Python 3 installations.

Here is the modified script that uses BerkeleyDB. First we have the various imports:

# Module of functions for calculating N-Gram frequency tables
# Originally written to create word tables for the Gutenberg and Wikipedia Corpora
# Due to their excessive size, output is to a BerkeleyDB

import string
import sys
import os
import re
import gc
import shutil

# Berkeley DB I/O
from bsddb3 import db
import cPickle as pickle

# Import the SentenceTokenizer (in module word_parser.py)
from word_parser import SentenceTokenizer
from nltk.probability import FreqDist

Next, we define a table FreqTableDB that encapsulates the BerkeleyDB key-value table and implements a frequency table. In the constructor, the BerkeleyDB in-memory cache is set to 1GB. You can set this larger if you have sufficient memory for all processes (including the remainder of this script). The class also exposes the BerkeleyDB’s cursor to enable simple iteration through the data.

Also of note is the calculate_total() method which calculates the total number of samples by iterating through the entire database. As well as returning the total number of samples, the total is saved in a special key “__total__”. This can be quickly read, instead of calling calculate_total() every time the total sample count is required.

Here is the FreqTableDB code:

class FreqTableDB( object ):
    # Open file for read or write/append
    # Claler should delete file, if new db is required
    def __init__ (self, fname, bAppend):
        self.bAppend = bAppend  # Open for write/append?
        self.filename = fname
        self.dbTable = db.DB()
        self.dbTable.set_cachesize(1,0)  # 1GB cache
        if (bAppend):
            self.dbTable.open(fname,None,
                              db.DB_HASH, db.DB_DIRTY_READ | db.DB_CREATE )
        else:
            self.dbTable.open(fname,None,
                              db.DB_HASH, db.DB_DIRTY_READ )

    # Close this database
    def close(self):
        if (self.dbTable is not None):
            self.dbTable.close()
            self.dbTable = None

    # Flush all changes or cache to disk
    def flush(self):
        if (self.bAppend):
            self.dbTable.sync()

    # Increment the count for a particular word
    def increment(self, word, inc=1):
        if (self.bAppend and self.dbTable is not None):
            v = self.get(word)
            pk = pickle.dumps(v+inc, pickle.HIGHEST_PROTOCOL)
            self.dbTable.put(word, pk )

    # Query the count for a particular word
    # "not found" implies a value of zero
    def get(self, word):
        if (self.dbTable is not None):
            pk = self.dbTable.get( word )
            if (pk is not None):
                return pickle.loads(pk)
        return 0

    # Fetch a cursor (standard bsddb3 cursor)
    # Cursor should be read with cursor_key, cursor_value
    def cursor(self):
        if (self.dbTable is not None):
            return self.dbTable.cursor()
        return None

    # Fetch the key (word) for the current cursor tuple ( cursor.next() )
    def cursor_key(self, cursor_tuple):
        if (cursor_tuple is not None):
            return cursor_tuple[0]
        return None

    # Fetch the key (value) for the current cursor tuple
    def cursor_value(self, cursor_tuple):
        if (cursor_tuple is not None):
            return pickle.loads( cursor_tuple[1] )
        return None

    # Calculate the total count for the entire table
    # Count is returned, and also saved as "__total__"
    def calculate_total(self):
        if (self.dbTable is not None):
            cursor = self.cursor()
            rec = cursor.first()
            nTotal = 0L
            while rec:
                if (rec[0] != '__total__'):
                    nTotal = nTotal + self.cursor_value(rec)
                rec = cursor.next()
            pk = pickle.dumps(nTotal, pickle.HIGHEST_PROTOCOL)
            self.dbTable.put('__total__', pk )
            return nTotal
        return 0

Next, we modify the WordFrequencyBuilder class to use the above BerkeleyDB class. Despite BerkeleyDB’s own in-memory caching, it is generally quicker to create small in-memory (NLTK FreqDist) frequency tables for groups of input data files. These are then periodically written to the database. This approach greatly reduces the number of required BerkeleyDB updates.

The builder class has also had some minor modifications regarding the treatment of numbers and punctuation. Here is the updated WordFrequencyBuilder class:

class WordFrequencyBuilder(object):
    # n = Length of N-Gram
    def __init__ (self, n, fname):
        self.N = n
        # Create one sentence tokenizer for re-use as required
        self.myTokenizer = SentenceTokenizer()
        # Create an empty (local cache) frequency distribution
        self.myFD = FreqDist()
        # Create empty frequency distributions for NGrams
        #self.myBigramFD = FreqDist()
        #self.myTrigramFD = FreqDist()

        # Create master frequency distribution
        self.myDB = FreqTableDB(fname, True)

        # regex for punctuation (breaks N-grams)
        self.regexPunct = re.compile(r'^[\W_]+$')
        # regex for numbers - includes partial word/numbers
        self.regexNum = re.compile(r'\d+')

        self.regexASCIIPrintable = re.compile(r'^[\x20-\x7E]+$')
        self.regexStripL = re.compile(r'^[\W_]+')
        self.regexStripR = re.compile(r'[\W_]+$')
        self.regexAllUnderscores = re.compile(r'^_+$')

    def close(self):
        self.myDB.flush()
        self.myDB.close()

    # Accessors

    def DB(self):
        return self.myDB

    # Utility functions

    # Strip any spaces or punctuation prefixes/suffixes
    # Note: If word is JUST punctuation, then it is passed through as-is
    # Except all underscores which are converted to empty-string
    def strip_word(self, word):
        wd = word.strip()
        if (self.regexAllUnderscores.match(wd)):
            return ''     # empty string - ignore all underscores
        if (self.regexPunct.match(wd) ):
            return wd   # pure punctuation - no change
        wd = self.regexStripL.sub( '', wd )
        wd = self.regexStripR.sub( '', wd )
        return wd

    # Used by buildTableForFile() to process a section of text for
    # and N-gram
    # Note: Always skips punctuation, and punctuation breaks an N-gram
    # Text is assumed to be a paragraph
    # text: Text to process
    # N: Length of n grams
    # incl_num, incl_punct: Should numbers or punctuation be included?
    # If N>1, sentence start/end is recorded as '<s>' and '</s>'    
    # If N=1, <s>, </s> are all skipped
    def processText(self, text, incl_num, incl_punct):
        # segment the text into words and sentences
        # Only words are required, but sentence segmentation is involved
        # because we want to interpret full stops correctly
        sentences = self.myTokenizer.segment_text(text)
        for sentence in sentences:
            n_gram = ['<s>']
            for word in sentence:
                wd = self.strip_word(word.lower())
                if (len(wd) ==0):
                    continue
                if (not self.regexASCIIPrintable.match(wd)):
                    # unrecognized word - reset the ngram
                    n_gram = [ ]
                    continue

                if (self.regexNum.match(wd) ):
                    # numeric
                    if (not incl_num):
                        n_gram = [ ]    # reset ngram
                        wd = ""
                        continue

                elif (self.regexPunct.match(wd) ):
                    # punctuation
                    if (not incl_punct):
                        # skip punctuation
                        n_gram = [ ]    # reset ngram
                        continue

                # if okay, add word (or symbol)
                n_gram.append( wd )

                # Shrink N-gram if it has grown too big
                if (len(n_gram) > self.N):
                    n_gram.pop(0)
                # save if big enough
                if (len(n_gram) == self.N):
                    self.myFD.inc( " ".join(n_gram) )
            # sentence finish - write a sentence marker if N>1
            if (self.N>1):
                n_gram.append( '</s>')
                if (len(n_gram) > self.N):
                    n_gram.pop(0)
                if (len(n_gram)==self.N):
                    self.myFD.inc( " ".join(n_gram) )

    # Add the words for a file to this frequency table
    # fname: Full path file name to read (plain text only)
    # include_numsym: True if you wish to include numbers and
    #                 symbol/punctuation tokens
    # Returns a reference to our frequency distribution
    # Note: This distribution is accumulative. Create a new class
    # to reset the table
    def buildTableForFile(self, fname, incl_num, incl_punct):
        # Read the text as one big string
        f = open(fname,"r")
        lines_text = f.readlines()
        f.close()

        # Process text in paragraphs, using empty lines as paragraph markers
        # This avoids inefficient text processing and re-allocations        
        full_text = ""
        for s in lines_text:
            ss = s.strip()
            if (len(ss) == 0  and len(full_text)>0):
                # Empty line => process what we have
                self.processText(full_text, incl_num, incl_punct)
                full_text = ""
            else:
                # Accumulate this line
                full_text = full_text + " " + ss

        # Process any remaining text
        if (len(full_text)>0):
            self.processText(full_text, incl_num, incl_punct)

    # Add the words for all files in the supplied directory, to this
    # frequency table. Directory is recursed if necessary
    # All files should be plain text.
    # path: Full path to the directory to read
    # include_numsym: True if you wish to include numbers and
    #                 symbol/puncuation tokens
    # Returns a reference to our frequency distribution
    # Note: This distribution is accumulative. Create a new class
    # to reset the table
    def buildTableForTextDir(self, path, incl_num, incl_punct):
        counter=0
        for dirname, dirnames, filenames in os.walk(path):
            for f in filenames:
                infile = os.path.join(dirname, f)
                self.buildTableForFile(infile, incl_num, incl_punct)
                #print "Size of table=", self.myFD.B()
                counter = counter+1
                if ( (counter % 100) == 0):
                    gc.collect()  # aggresively garbage collect
                    print counter
                if ( (counter % 1000) == 0):
                    # Copy FD to the master
                    print "  (saving local table)"
                    for bigram in self.myFD:
                        bb = bigram.strip()
                        if (len(bb)>0):
                            #    if ( not self.regexNum.match(bb) ):
                            self.myDB.increment(bb, self.myFD[bigram] )
                    print "  (flushing changes)"
                    self.myDB.flush()
                    self.myFD = FreqDist()
                    gc.collect()
                    print counter," No. of keys:",self.myDB.dbTable.stat()["nkeys"]

        # All files finished, Finalize remaining in-memory data
        # Copy FD to the master
        print "  (saving local table)"
        for bigram in self.myFD:
            bb = bigram.strip()
            if (len(bb)>0):
                #    if ( not self.regexNum.match(bb) ):
                self.myDB.increment(bb, self.myFD[bigram] )
        print "  (flushing changes)"
        self.myDB.flush()
        self.myFD = FreqDist()
        gc.collect()
        print counter,"Final: No. of keys:",self.myDB.dbTable.stat()["nkeys"]

Finally we define a script main, allowing the script to be used from the command line. This also serves as a usage example:

# Main script - create NGram frequency table for the supplied path
# Usage: python build_ngram_db.py N /my/input/path /my/output/table.bdb
# N = Size of ngram

if __name__ == '__main__':

    if (len(sys.argv) != 4):
        sys.stderr.write("Usage: python %s N inputpath outputfile\n" % sys.argv[0])
        raise SystemExit(1)

    Ngram = int(sys.argv[1])
    input_path = sys.argv[2]
    fname = sys.argv[3]
    print "NGram size: ", int(Ngram)

    # Remove any existing table
    #comment this out if you wish to append to an existing file
    if (os.path.exists(fname)):
        os.remove(fname)

    print "Scanning for NGrams..."
    myNGramWF = WordFrequencyBuilder(Ngram,fname)

    fd = myNGramWF.buildTableForTextDir( input_path, False, True)

    myDB = myNGramWF.DB()

    # Display the first 200 bigrams as a simple demo
    cursor = myDB.cursor()
    rec = cursor.first()
    cc =0
    while rec and (cc<200):
        print ">"+myDB.cursor_key(rec)+"<->"+str(myDB.cursor_value(rec))
        rec= cursor.next()
        cc = cc+1

    nTotal = myDB.calculate_total()
    print "Total samples=",nTotal

    myNGramWF.close()

And that is it! This script will still take days to process a large dataset such as the Wikipedia pages, but it will do this without running out of memory. If memory does pose a problem, restrict the BerkeleyDB cache further, and write the NLTK FreqDist cache out more frequently (e.g. every 200 input files instead of every 1000).

The resulting BerkeleyDB databases are too large to make available as a simple download. Instead, I shall be making them available as an Azure web service in the next few weeks. Availability and demo code will be published here.

Related Posts