LF_TrainChunking Processing Resource
The chunking training PR allows you to train a model for finding chunks in text, such as entity names or verb phrases. The PR has no init time parameters.
While the task of classification learning is to find the correct class label for an instance annotation, for chunk learning we want to find where an annotation should be placed. For example, we want to learn where to place “Person” annotaitons. The learning algorithm thus needs to know the annotation type of the annotations we are interested in - the class annotation type. The way the learning algorithm tries to build a model for this is to use the features derived from instance annotations, for example Token annotations and annotations overlapping with instance annotations (described in the feature specification file). For each instance annotation, the model can then decide if a class annotation should start, continue or end at that instance annotation.
This way of creating a model for finding chunks, or class annotations, can be carried out by two kinds of learning algorithms: sequence tagging algorithms and conventional classification algorithms. Sequence tagging algorithms use a whole sequence of instance annotations to make a decision about whether a class annotation should start, continue or and at that instance annotation, also considering dependencies between those instance annotaitons while classification algorithms look at each instance annotation separately. Because sequence tagging annotations need to consider a whole sequence of annotations, they need to know about “sequence annotations” in addition to instance annotations and class annotations: the sequence annotations cover the sequence of instance annotations the tagging algorithm will use as a unit, for example, a Sentence or Paragraph.
Runtime parameters
algorithmParameters
(String, no default) parameters influencing the training algorithm (see below)classAnnotationType
the annotation type of the annotations which the model should learn to find, e.g. “Person”.dataDirectory
(URL, no default, required) the directory where to save all the files generated by the algorithm (model file, dataset description file, information file etc). The file names are always the same, so a different directory MUST be used to keep them separate.featureSpecURL
(URL, no default, required) the XML file describing the features to use, see FeatureSpecificationinputASName
(String, default is the empty String for the default annotation set) input annotation set containining the instance annotations, the annotations specified in the feature specification and the sequenceSpan annotations, if used.instanceType
(String, default “Token”, required) the annotation type of instance annotations.scaleFeatures
(Enumeration, default NONE) how to scale features, if at all. Possible values:NONE
do not do any scaling at allMEANVARIANCE_ALL_FEATURES
normalize all features to have mean 0 and variance 1. [NOTE: this is not implement properly yet and may change in the future!]. See FeatureScaling
sequenceSpan
(String, no default) this must be used for sequence tagging algorithms only! For such algorithms, it specifies the span across which to learn a sequence; for example a sentence is a meaningful sequence of words. If used like this a sequence algorithm can be used for classification, although this is not normally what one wants to do.targetFeature
(String, no default, required) the feature on the instance annotation that contains the nominal value which represents the class label. All instance annotations should have a class label.trainingAlgorithm
the classification training algorithm to use. See below for details.
Training a model
Features used in chunking problems tend to be a rich array of information about the token in question, helping to answer the question, is this in a chunk or is it out? (More specifically, the Learning Framework implements chunking using the BIO approach–is this token a beginning, an inside or an outside?) The simple example below uses a variety of features from the token, but it is equally acceptable to use features from other co-located annotation types.
<ML-CONFIG>
<ATTRIBUTE>
<TYPE>Token</TYPE>
<FEATURE>category</FEATURE>
<DATATYPE>nominal</DATATYPE>
</ATTRIBUTE>
<ATTRIBUTE>
<TYPE>Token</TYPE>
<FEATURE>kind</FEATURE>
<DATATYPE>nominal</DATATYPE>
</ATTRIBUTE>
<ATTRIBUTE>
<TYPE>Token</TYPE>
<FEATURE>length</FEATURE>
<DATATYPE>numeric</DATATYPE>
</ATTRIBUTE>
<ATTRIBUTE>
<TYPE>Token</TYPE>
<FEATURE>orth</FEATURE>
<DATATYPE>nominal</DATATYPE>
</ATTRIBUTE>
<ATTRIBUTE>
<TYPE>Token</TYPE>
<FEATURE>string</FEATURE>
<DATATYPE>nominal</DATATYPE>
</ATTRIBUTE>
</ML-CONFIG>
Algorithms and their Parameters
Setting up the Transducer and training a sequence tagging model is more complex than training classication algorithms. The Mallet API offers a lot of flexibility of how to do this while the LearningFramework only provides a subset of this flexibility through the algorithm parameters. To take full advantage of the Malle API, it is possible to use it directly from Java code, e.g. by using the Java Plugin. This is described in more detail in API.
MalletCRF_SEQ_MR
This uses the algorithm cc.mallet.fst.CRFTrainerByThreadedLabelLikelihood
and
cc.mallet.fst.CRFTrainerByLabelLikelihood
see http://mallet.cs.umass.edu/api/cc/mallet/fst/CRFTrainerByLabelLikelihood.html and http://mallet.cs.umass.edu/api/cc/mallet/fst/CRFTrainerByThreadedLabelLikelihood.html and the cc.mallet.fst.CRF
class to represent the CRF.
Parameters:
-threads
-t
(Integer, default: not specified): if this is specified the CRFTrainerByThreadedLabelLikelihood is used with the given number of threads, otherwise the CRFTrainerByLabelLikelihood is used.-states
-S
(String, defaultfully-connect
): this defines how the CRF states get initialized, one of the following values can be specified:fully-connected
as-in
fully-threequarter
half
order-n
: Specify the order of the CRF. This will initialize the connections weights from the full training set, the order and connections can be specified using the-ofully
andorders
parameters.
-ofully
-f
(boolean): only relevant iforder-n
is specified for-states
. Initialize the CRF with all connections, not just the ones seen in the training set.-orders
-o
(string, default is “0:1”): only relevant iforder-n
is specified for-states
. Specify the order of the features for the CRF. This must be a non-empty list of increasing non-negative numbers, separated by colons. Currently only the following lists are supported: 0, 0:1, 0:1:2, 1, 1:2, 2. The highest number in the list specifies the Markof order of the CRF.-addstart
-a
(boolean, default: false) If specified will add an explicit start state to the CRF.-logViterbiPaths
-v
(integer, default: MAX_INT) after the number of optimization iterations specified, the viterbi paths will be written to a file LF_debug..viterbi -useSparseWeights
-usw
-setSomeUnsupportedTrick
-ssut
MalletCRFSG_SEQ_MR
MalletCRFVG_SEQ_MR
MalletMEMM_SEQ_MR
PytorchWrapper_CL_DR
and PytorchWrapper_SEQ_DR
This uses the Python-based wrapper for the Pytorch back-end. In order to use this, the Python environment must first be prepared on your machine. See the documentation for the Pytorch backend for more.
KerasWrapper_CL_DR
and KerasWrapper_SEQ_DR
This uses the Python-based wrapper for the Keras/Tensorflow back-end. In order to use this, the Python environment must first be prepared on your machine. See the documentation for the Keras backend for more.