Skip to content

LogonModeling

StephanOepen edited this page Dec 4, 2008 · 25 revisions

Overview

This page contains various code examples showing how to estimate and apply statistical models within LOGON. For more detailed information on feature types, estimation parameters, or the experimentation environment in general, see [http://www.velldal.net/erik/pubs/Velldal08.pdf Velldal (2008)].

Discriminative Modeling

In the following, we assume that 'generation treebanks' for the LOGON JHPSTG and Rondane corpora are available. For the HandOn release (of November 2008) of the LOGON system, these treebanks can be installed into the lingo/redwoods/ directory from SVN; see the LogonExtras page for instructions on how to install add-on LOGON components. However, in principle, these instructions should be applicable to other Redwoods-style treebanks.

The examples below assume that the itsdb database root is pointing to the collection of LOGON treebanks, i.e. the directory lingo/redwoods/tsdb/home/, which is one of the SVN add-on components (see the LogonExtras page). We further assume that the complete LOGON system and correct grammar (the ERG from lingo/redwoods/erg/, in our case) are already loaded.

Set the feature parameters. The system defaults correspond to:

  (let ((*feature-grandparenting* 4)
        (*feature-active-edges-p* t)
        (*feature-ngram-size* 4)
        (*feature-ngram-back-off-p* t)
        (*feature-ngram-tag* :type)
        (*feature-use-preterminal-types-p* t)
        (*feature-lexicalization-p* t)
        (*feature-constituent-weight* 2)
        (*feature-lm-p* 10)
        (*feature-frequency-threshold* nil))

    ...)

Create a feature cache for the (virtual) profile jhpstg.g (we typically use the .g suffix for generation treebanks):

  (setq gold "jhpstg.g")
  (operate-on-profiles (list gold) :task :fc))

Intended as a one-time operation, the feature caching extracts all the features from the treebank and stores them in a (Berkeley DB) database within the respective profile directory (named fc.bdb). When running experiments later, this means that we simply look up the features in the DB, saving us the cost of extraction. A symbol table named fc.mlm (also created within the jhpstg.g profile for the example above) records the mapping from symbolic feature representations to numerical indexes (as used for model estimation and DB storage). The symbol table is only referenced when exporting or applying a model to new data (see the example below), but it can also be useful to inspect manually, e.g. to confirm that features have the correct form, plausible counts, plausible value ranges, etc.

Example of how to run a single experiment using 5-fold cross-validation:

  (setq test "jhpstg.t")
  (tsdb :create test :skeleton "jhpstg")
  (rank-profile gold test :nfold 5)

Running a batch of 10-fold MaxEnt experiments on jhpstg.g, iterating over different configurations of features and estimation parameters (the top-level function batch-experiment() performs an exhaustive 'grid search' over all combinations of specified parameter values):

  (batch-experiment
   :type :mem
   :variance '(nil 1000 100 10 1 1.0e-1 1.0e-2)
   :absolute-tolerance 1.0e-10
   :source "jhpstg.g"
   :skeleton "jhpstg"
   :random-sample-size nil
   :ngram-size '(0 1 2 3)
   :active-edges-p nil
   :grandparenting '(0 1 2 3)
   :lm-p 10
   :counts-relevant 1
   :nfold 10
   :compact nil)

The following gives a brief explanation of the various keyword arguments. The :variance parameter governs the Gaussian prior on feature weights.; :absolute-tolerance governs the convergence threshold. Specifying a non-nil (integer) value n for :random-sample-size means that only a random selection of (maximally) n non-preferred candidates for each item is included in the training data. The parameter :counts-relevant governs a frequency-based cutoff on feature values. The keywords :ngram-size, :active-edges-p, and :grandparenting allow iteration over feature parameters. Note that specifying :lm-p 10 means that the value of the language model feature is divided by 10; this is basically a hack to avoid numerical problems during estimation. To leave out the LM feature, call with :lm-p nil instead. Specifying :type :mem means that we are training a conditional maximum entropy model (aka log-linear model). The value of :type could also be :svm if you have SVMlight installed (it is currently not part of the LOGON dstribution). The boolean-valued :compact governs the naming convention when creating target profiles, i.e. if the profile names for the 10-fold cross validation experiments look excessively long (or even cause issues with OS-imposed limits on the total length of pathnames), try t as the :compact value.

Example of how to estimate and export a maxent model:

  (let ((*feature-grandparenting* 3)
        (*feature-ngram-size* 3)
        (*feature-lm-p* nil)
        (*maxent-variance* 8e-4)
        (*feature-frequency-threshold* (make-counts :relevant 1)))
    (train "jhpstg.g" "jhpstg.g.mem" :fcp nil :type :mem))

This writes the estimated model to jhpstg.g.mem (for more information on the format, see Chapter 6 of [http://www.velldal.net/erik/pubs/Velldal08.pdf Velldal (2008)]). The keyword argument :fcp nil means that we do not want to create a feature cache, but rather use the one we already have.

Applying the model trained above to the generation treebank rondane.g:

  (tsdb :create "rondane.t" :skeleton "rondane")

  (operate-on-profiles
    (list "rondane.g") :model (read-model "jhpstg.m.mem")
    :target "rondane.t" :task :rank)

Automating Experiments

Once the LOGON ERG add-on treebanks are installed (see the LogonExtras page), there are several Lisp parameter files and a shell script (called lingo/redwoods/load) to run the steps above from the command line. For example, the creation of a feature cache (on the default generation treebank jhpstg.g) can be automated as follows:

  cd $LOGONROOT/lingo/redwoods
  ./load --binary fc.g.lisp

The parameter file grid.g.lisp provides the default setup for an exhaustive grid search for the best-performing combinations of features and meta-parameters. Once optimal parameters values are identified, the file train.g.lisp automates the training and serialization of a MaxEnt model file, in this case called (by default) jhpstg.g.mem. All of the Lisp files are intended for use with the load script, analogous to the call example given above. Even on adequate hardware (we recommend a 64-bit Linux environment with at least eight gigabytes of available RAM), each of these steps can take substantial time, i.e. between several hours and many days (for the grid search, depending on how many parameter variations are explored).

Note that there are 'families' of Lisp parameter files in lingo/redwoods/, one for parse ranking (fc.lisp, grid.lisp, train.lisp), one for realization ranking (used in our examples above), and another one for end-to-end MT re-ranking (fc.r.lisp, grid.r.lisp, and train.r.lisp). For each task, there is an additional file defining the default environment, called parsing.lisp, generation.lisp, and ranking.lisp). Furthermore, the itsdb configuration for automated MaxEnt experimentation is determined by the file dot.tsdbrc in lingo/redwoods/.

These files are distributed primarily to serve as examples for similar experimentation. To vary the nature of each experiment (e.g. using different treebanks, another grammar or MT configuration, or additional feature types), it may be necessary to adapt dot.tsdbrc and the Lisp configuration files appropriately (or even the load script). Where possible, we recommend copying the existing files to create a new 'family' of parameter settings and tasks.

Clone this wiki locally