LF_TrainChunking Processing Resource

The chunking training PR allows you to train a model for finding chunks in text, such as entity names or verb phrases. The PR has no init time parameters.

While the task of classification learning is to find the correct class label for an instance annotation, for chunk learning we want to find where an annotation should be placed. For example, we want to learn where to place “Person” annotaitons. The learning algorithm thus needs to know the annotation type of the annotations we are interested in - the class annotation type. The way the learning algorithm tries to build a model for this is to use the features derived from instance annotations, for example Token annotations and annotations overlapping with instance annotations (described in the feature specification file). For each instance annotation, the model can then decide if a class annotation should start, continue or end at that instance annotation.

This way of creating a model for finding chunks, or class annotations, can be carried out by two kinds of learning algorithms: sequence tagging algorithms and conventional classification algorithms. Sequence tagging algorithms use a whole sequence of instance annotations to make a decision about whether a class annotation should start, continue or and at that instance annotation, also considering dependencies between those instance annotaitons while classification algorithms look at each instance annotation separately. Because sequence tagging annotations need to consider a whole sequence of annotations, they need to know about “sequence annotations” in addition to instance annotations and class annotations: the sequence annotations cover the sequence of instance annotations the tagging algorithm will use as a unit, for example, a Sentence or Paragraph.

Runtime parameters

Training a model

Features used in chunking problems tend to be a rich array of information about the token in question, helping to answer the question, is this in a chunk or is it out? (More specifically, the Learning Framework implements chunking using the BIO approach–is this token a beginning, an inside or an outside?) The simple example below uses a variety of features from the token, but it is equally acceptable to use features from other co-located annotation types.

<ML-CONFIG>

<ATTRIBUTE>
<TYPE>Token</TYPE>
<FEATURE>category</FEATURE>
<DATATYPE>nominal</DATATYPE>
</ATTRIBUTE>

<ATTRIBUTE>
<TYPE>Token</TYPE>
<FEATURE>kind</FEATURE>
<DATATYPE>nominal</DATATYPE>
</ATTRIBUTE>

<ATTRIBUTE>
<TYPE>Token</TYPE>
<FEATURE>length</FEATURE>
<DATATYPE>numeric</DATATYPE>
</ATTRIBUTE>

<ATTRIBUTE>
<TYPE>Token</TYPE>
<FEATURE>orth</FEATURE>
<DATATYPE>nominal</DATATYPE>
</ATTRIBUTE>

<ATTRIBUTE>
<TYPE>Token</TYPE>
<FEATURE>string</FEATURE>
<DATATYPE>nominal</DATATYPE>
</ATTRIBUTE>

</ML-CONFIG>

Algorithms and their Parameters

Setting up the Transducer and training a sequence tagging model is more complex than training classication algorithms. The Mallet API offers a lot of flexibility of how to do this while the LearningFramework only provides a subset of this flexibility through the algorithm parameters. To take full advantage of the Malle API, it is possible to use it directly from Java code, e.g. by using the Java Plugin. This is described in more detail in API.

MalletCRF_SEQ_MR

This uses the algorithm cc.mallet.fst.CRFTrainerByThreadedLabelLikelihood and cc.mallet.fst.CRFTrainerByLabelLikelihood see http://mallet.cs.umass.edu/api/cc/mallet/fst/CRFTrainerByLabelLikelihood.html and http://mallet.cs.umass.edu/api/cc/mallet/fst/CRFTrainerByThreadedLabelLikelihood.html and the cc.mallet.fst.CRF class to represent the CRF.

Parameters:

MalletCRFSG_SEQ_MR

MalletCRFVG_SEQ_MR

MalletMEMM_SEQ_MR

PytorchWrapper_CL_DR and PytorchWrapper_SEQ_DR

This uses the Python-based wrapper for the Pytorch back-end. In order to use this, the Python environment must first be prepared on your machine. See the documentation for the Pytorch backend for more.

KerasWrapper_CL_DR and KerasWrapper_SEQ_DR

This uses the Python-based wrapper for the Keras/Tensorflow back-end. In order to use this, the Python environment must first be prepared on your machine. See the documentation for the Keras backend for more.